From fyang at openjdk.org Thu Dec 1 06:03:17 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 1 Dec 2022 06:03:17 GMT Subject: RFR: 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension In-Reply-To: References: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> Message-ID: <3LBcFZb6kNiDNg1YjnZK3SWpdyQX0gTNMm94hItOhGo=.a76d02bc-9bbb-4c7f-99e1-52fb3d001d58@github.com> On Tue, 29 Nov 2022 06:48:25 GMT, Feilong Jiang wrote: >> The single-bit instructions from the Zbs extension provide a mechanism to set, clear, >> invert, or extract a single bit in a register. The bit is specified by its index. >> >> Especially, the single-bit extract (immediate) instruction 'bexti rd, rs1, shamt' [1] performs: >> >> let index = shamt & (XLEN - 1); >> X(rd) = (X(rs1) >> index) & 1; >> >> >> This instruction is a perfect match for following C2 sub-graph when integer immediate 'mask' is power of 2: >> >> Set dst (Conv2B (AndI src mask)) >> >> >> The effect is that we could then optimize C2 JIT code for methods like [2]: >> Before: >> >> lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags >> andi R7, R28, #8 #@andI_reg_imm >> snez R10, R7 #@convI2Bool >> >> >> After: >> >> lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags >> bexti R10, R28, 3 # >> >> >> Testing: Tier1-3 hotspot & jdk tested with QEMU (JTREG="VM_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseZbs"). >> >> [1] https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc >> >> [2] https://github.com/openjdk/jdk/blob/master/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/dom/NodeImpl.java#L1936 > > Looks good. @feilongjiang @yadongw : Thanks for looking at this. Need a Reviewer then. Maybe @shipilev ? ------------- PR: https://git.openjdk.org/jdk/pull/11406 From thartmann at openjdk.org Thu Dec 1 06:30:26 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 1 Dec 2022 06:30:26 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace Performance results look good (no measurable difference). A second review would be good though. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From shade at openjdk.org Thu Dec 1 08:19:21 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 1 Dec 2022 08:19:21 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v6] In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Merge branch 'master' into JDK-8296545-blackhole-effects - Merge branch 'master' into JDK-8296545-blackhole-effects - Add comment in cfgnode.hpp - Blackhole as CFG node - Merge branch 'master' into JDK-8296545-blackhole-effects - Blackhole should be AliasIdxTop - Do not touch memory at all - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11041/files - new: https://git.openjdk.org/jdk/pull/11041/files/49a34ed7..fd2aea6b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=04-05 Stats: 6319 lines in 275 files changed: 3693 ins; 1403 del; 1223 mod Patch: https://git.openjdk.org/jdk/pull/11041.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041 PR: https://git.openjdk.org/jdk/pull/11041 From chagedorn at openjdk.org Thu Dec 1 08:26:26 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 08:26:26 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace Do we really need this bailout? From your very [first picture](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png), it does not seem wrong to replace `125 Phi` - the only problem is that `465 CastII` also happens to have `432 CountedLoop` as control input. This leads to the crash because we are removing two outputs of `432 CountedLoop` at once which is not valid and previously unexpected (before [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585)) . Couldn't we instead just update `replace_parallel_iv()` in such a way that it can handle this particular case of removing two outputs at once from the counted loop? ------------- PR: https://git.openjdk.org/jdk/pull/9695 From haosun at openjdk.org Thu Dec 1 09:01:26 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 1 Dec 2022 09:01:26 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 10:00:02 GMT, Andrew Haley wrote: >>> What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)` >> >> I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851 > >> > What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)` >> >> I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851 > > Interesting. I didn't see this happening. I'll have another look. Hi @theRealAph , > It seems to me that the enhancement here, if it exists, is in the noise. That may well be because the test is dominated by memory traffic. Yes. I agree. We should use more `compareUnsigned` or use less memory loading in the loop. I will show you the new test data later. > Part of the problem is that using `cmp; cset; cneg` doesn't take advantage of branch prediction. And more often than not, the result of a comparison is predictable. Yes. I agree. I'd like to share my investigation here. ### Motivation of this intrinsic I checked the two PRs in [JDK-8283726](https://bugs.openjdk.org/browse/JDK-8283726), i.e. the initial x86 intrinsic task. >From the discussions of https://github.com/openjdk/jdk/pull/7975 and https://github.com/openjdk/jdk/pull/9068, I think there are two motivations to introduce this `compareUnsigned` intrinsic. **motivation-1**: > the compiler can recognise the pattern x + MIN_VALUE < y + MIN_VALUE and transforms it into x u< y. This transformation is fragile however if one of the arguments is in the form x + con See the discussion [here](https://github.com/openjdk/jdk/pull/7975#issuecomment-1082464476). I think that's why `bound - 16` is used rather than simply `bound` in [the JMH case](https://github.com/openjdk/jdk/pull/9068/files#diff-0a47e045a0f44094a7ec1fe7a251cc1d99d9fcd9c330cb33a79dd58e7f21b5d6R163). **motivation-2**: An optimization can be introduced, i.e. idealizing `(cmp (cmp3 x y) < 0) to (cmpU x y)` There is `compareUnsignedIndirect` function in the JMH case to show the performance uplift of this optimization. ### Update the JMH case I'd like to use `Integer` case as an illustrating example, and I think `Long` can be handled in the similar way. Here shows my update to the JMH case. // Compare between random values. // Use big loops // Operate with another constant before cmpU3 @Benchmark public void compareUnsignedDirect2(Blackhole bh) { int r = 0; int inx1 = 0; int inx2 = 0; int i, e1, e2; for (i = 0; i < size; i++) { inx1 = intsTiny[i]; inx2 = intsSmall[i]; for (i = 0; i < size; i++) { inx1 = (inx1 + i) % size; inx2 = (inx2 + i) % size; e1 = intsBig[inx1]; e2 = intsBig[inx2]; r += Integer.compareUnsigned(e1, e2 - 16); r -= Integer.compareUnsigned(r, e1 - 16); r += Integer.compareUnsigned(r, e2 - 16); } } bh.consume(r); } @Benchmark public void compareUnsignedIndirect2(Blackhole bh) { int r = 0; int inx1 = 0; int inx2 = 0; int i, e1, e2; for (i = 0; i < size; i++) { inx1 = intsTiny[i]; inx2 = intsSmall[i]; for (i = 0; i < size; i++) { inx1 = (inx1 + i) % size; inx2 = (inx2 + i) % size; e1 = intsBig[inx1]; e2 = intsBig[inx2]; r += (Integer.compareUnsigned(e1, e2 - 16) < 0) ? 1 : 0; r -= (Integer.compareUnsigned(r, e1 - 16) < 0) ? 1 : 0; r += (Integer.compareUnsigned(r, e2 - 16) < 0) ? 1 : 0; } } bh.consume(r); } Note-1: random values are passed to `compareUnisnged` to evaluate `branch predication` affect. Inevitably, memory load is introduced in the inner loop, i.e. getting the values of `e1` and `e2`. Note-2: big loops are used. In my evaluation, I passed `10240` to parameter `size`. Note-3: more `compareUnisnged` operations are done in the inner loop. Note-4: similarly, we pass `- 16` to `compareUnsigned` and use `Direct/Indirect` versions. ### Evaluation data on x86 and aarch64 Note-1: `before` means `disable intrinsic _compareUnsigned_i`, and `after` means `enable it`. Note-2: the unit is `us/op`. The smaller, the better. For case **compareUnsignedDirect2** on AArch64: after : 55.144 ? 8.887 us/op before: 51.423 ? 6.891 us/op on x86: after : 70.549 ? 3.729 us/op before: 67.781 ? 2.876 us/op For case **compareUnsignedIndirect2** on AArch64: after : 48.915 ? 11.054 us/op before: 52.782 ? 8.264 us/op on x86: after : 68.107 ? 3.768 us/op before: 70.231 ? 3.966 us/op >From the evaluation data, we can see 1) There are **performance regression** for case **compareUnsignedDirect2** on both aarch64 and x86. I checked the C2 generated code for this case on aarch64, and I don't think `e - 16` can prevent the C2 transformation " x+min_val OP y+min_val -> x uOP y". I think C2 can generate good enough code and our intrinsic didn't win. Perhaps, **motivation-1** is not accurate?? 2) There are **slight performance uplifts** for case **compareUnsignedIndirect2** on both aarch64 and x86. Hence, the idealization optimization (See **motivation-2**) works. I checked the generated code with this optimization and the sequence `cmp; cset; cneg` is gone. With the investigation above, I personally don't have strong reason to introduce this intrinsic to aarch64 part. WDYT? @theRealAph Besides, it would be nice if you could take a look at this discussion @merykitty? Is there anything I missed or misunderstood? Thanks in advance. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From smonteith at openjdk.org Thu Dec 1 09:11:17 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Thu, 1 Dec 2022 09:11:17 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v3] In-Reply-To: References: Message-ID: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% > > > These methods can bed specifically tested with: > `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` Stuart Monteith has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'openjdk:master' into JDK-8294194 - Update src/hotspot/cpu/aarch64/aarch64.ad Correct slight formatting error. Co-authored-by: Eric Liu - 8294194: Create intrinsics compress and expand The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. Running on an SVE2 enabled system, I ran the following benchmarks: org.openjdk.bench.java.lang.Integers org.openjdk.bench.java.lang.Longs The time for each operation reduced to 56% to 72% of the original run time: Benchmark Result error Unit % against non-SVE2 Integers.expand 2.106 0.011 us/op Integers.expand-SVE 1.431 0.009 us/op 67.95% Longs.expand 2.606 0.006 us/op Longs.expand-SVE 1.46 0.003 us/op 56.02% Integers.compress 1.982 0.004 us/op Integers.compress-SVE 1.427 0.003 us/op 72.00% Longs.compress 2.501 0.002 us/op Longs.compress-SVE 1.441 0.003 us/op 57.62% ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10537/files - new: https://git.openjdk.org/jdk/pull/10537/files/8b13dabb..a7484586 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=01-02 Stats: 319140 lines in 4215 files changed: 161741 ins; 101533 del; 55866 mod Patch: https://git.openjdk.org/jdk/pull/10537.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10537/head:pull/10537 PR: https://git.openjdk.org/jdk/pull/10537 From chagedorn at openjdk.org Thu Dec 1 09:47:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 09:47:07 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top Message-ID: ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. Thanks, Christian ------------- Commit messages: - 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top Changes: https://git.openjdk.org/jdk/pull/11448/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11448&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297264 Stats: 81 lines in 3 files changed: 81 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11448.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11448/head:pull/11448 PR: https://git.openjdk.org/jdk/pull/11448 From roland at openjdk.org Thu Dec 1 11:00:16 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 1 Dec 2022 11:00:16 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization Message-ID: Backout that change due to hard to reproduce failures and some failures with stress options. This includes the backout of JDK-8297556 and JDK-8297343 which are fixes to the initial change. ------------- Commit messages: - Revert "6312651: Compiler should only use verified interface types for optimization" - Revert "8297343: TestStress*.java fail with "got different traces for the same seed"" - Revert "8297556: Parse::check_interpreter_type fails with assert "must constrain OSR typestate"" Changes: https://git.openjdk.org/jdk/pull/11450/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11450&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297934 Stats: 1672 lines in 22 files changed: 514 ins; 823 del; 335 mod Patch: https://git.openjdk.org/jdk/pull/11450.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11450/head:pull/11450 PR: https://git.openjdk.org/jdk/pull/11450 From yyang at openjdk.org Thu Dec 1 11:30:26 2022 From: yyang at openjdk.org (Yi Yang) Date: Thu, 1 Dec 2022 11:30:26 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 08:23:41 GMT, Christian Hagedorn wrote: > it does not seem wrong to replace 125 Phi - the only problem is that 465 CastII also happens to have 432 CountedLoop as control input. If replacing Phi#125 is acceptable, we need to reach a consensus that the form of `Add->CastII->Phi` can be considered as parallel IV, which is not valid before [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585). Prior to that, we only considered variables of the form `X += constant`(`Add->Phi`) to be IV. Do we have such a consensus that the form of `Add->CastII->Phi` is considered as parallel IV? A conservative approach would be a compilation bailout, as this patch does. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Thu Dec 1 11:30:43 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 1 Dec 2022 11:30:43 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 10:49:53 GMT, Roland Westrelin wrote: > Backout that change due to hard to reproduce failures and some > failures with stress options. This includes the backout of JDK-8297556 > and JDK-8297343 which are fixes to the initial change. Changes requested by thartmann (Reviewer). test/hotspot/jtreg/ProblemList.txt line 73: > 71: compiler/debug/TestStressCM.java 8297343 generic-all > 72: compiler/debug/TestStressIGVNAndCCP.java 8297343 generic-all > 73: This should be removed. ------------- PR: https://git.openjdk.org/jdk/pull/11450 From thartmann at openjdk.org Thu Dec 1 11:44:18 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 1 Dec 2022 11:44:18 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top In-Reply-To: References: Message-ID: <2NbVbHEuu60NbtCHBggX5uU3VPBrfD_NdvT_A2z8qpE=.ce596fb1-16c9-468f-a9e1-66736d9ab522@github.com> On Thu, 1 Dec 2022 09:37:57 GMT, Christian Hagedorn wrote: > ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) > > During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: > https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 > > Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. > > I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. > > Thanks, > Christian Looks good to me. FTR, the verification code I proposed in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) would catch this. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11448 From aph at openjdk.org Thu Dec 1 11:46:21 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 1 Dec 2022 11:46:21 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op Try this one: @Benchmark public int compareUnsignedDirect(Blackhole bh) { int probe1 = seed, probe2 = seed ^ seed << 5; int sum = 0; for (int i = 0; i < size; i++) { probe1 ^= probe1 << 13; sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 << 13; probe1 ^= probe1 >>> 17; sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 >>> 17; probe1 ^= probe1 << 5; sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 << 5; sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; } seed = probe2 + probe1; return sum; } Does that help? ------------- PR: https://git.openjdk.org/jdk/pull/11383 From chagedorn at openjdk.org Thu Dec 1 12:17:09 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 12:17:09 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace > If replacing Phi#125 is acceptable, we need to reach a consensus that the form of Add->CastII->Phi can be considered as parallel IV I think it should be safe and we've already done that since JDK-8273585 and haven't seen a crash related to that idea. But given how close we are to the fork, I suggest to go with your bailout fix for JDK 20 which is safer (and performance testing done by Tobias looks good). In this way, we really only optimize the pattern originally intended in JDK-8273585. For the general case of allowing any cast node, I suggest to file an RFE and investigate again if that is possible/correct. I have the feeling that it is but it would be better to defer that to JDK 21. What do you think? Thanks, Christian ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Thu Dec 1 12:18:55 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 1 Dec 2022 12:18:55 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace That makes sense, you can re-use/extend [JDK-8297307](https://bugs.openjdk.org/browse/JDK-8297307) for that. Vladimir should also have a look at this again. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From roland at openjdk.org Thu Dec 1 12:22:29 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 1 Dec 2022 12:22:29 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: > Backout that change due to hard to reproduce failures and some > failures with stress options. This includes the backout of JDK-8297556 > and JDK-8297343 which are fixes to the initial change. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11450/files - new: https://git.openjdk.org/jdk/pull/11450/files/83fc3f11..b4cf71a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11450&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11450&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11450.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11450/head:pull/11450 PR: https://git.openjdk.org/jdk/pull/11450 From roland at openjdk.org Thu Dec 1 12:22:33 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 1 Dec 2022 12:22:33 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 11:27:38 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > test/hotspot/jtreg/ProblemList.txt line 73: > >> 71: compiler/debug/TestStressCM.java 8297343 generic-all >> 72: compiler/debug/TestStressIGVNAndCCP.java 8297343 generic-all >> 73: > > This should be removed. Right. Done. ------------- PR: https://git.openjdk.org/jdk/pull/11450 From chagedorn at openjdk.org Thu Dec 1 12:23:28 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 12:23:28 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top In-Reply-To: <2NbVbHEuu60NbtCHBggX5uU3VPBrfD_NdvT_A2z8qpE=.ce596fb1-16c9-468f-a9e1-66736d9ab522@github.com> References: <2NbVbHEuu60NbtCHBggX5uU3VPBrfD_NdvT_A2z8qpE=.ce596fb1-16c9-468f-a9e1-66736d9ab522@github.com> Message-ID: <2wtCZMitDCgzB6CfG9M-cEF4Kyo2Vigo17nfoW9m4qE=.93002247-c257-447c-9ce4-d9f2f532b654@github.com> On Thu, 1 Dec 2022 11:42:00 GMT, Tobias Hartmann wrote: > Looks good to me. FTR, the verification code I proposed in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) would catch this. Thanks Tobias for your review! Yes, that's a good point. I think it would be good to get this verification code in at some point to catch such issues with CCP earlier. ------------- PR: https://git.openjdk.org/jdk/pull/11448 From yyang at openjdk.org Thu Dec 1 12:29:06 2022 From: yyang at openjdk.org (Yi Yang) Date: Thu, 1 Dec 2022 12:29:06 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace > For the general case of allowing any cast node, I suggest to file an RFE and investigate again if that is possible/correct. > > Thanks, Christian I think it?s really reasonable. As Tobias said, we can reuse https://bugs.openjdk.org/browse/JDK-8297307 for this purpose. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Thu Dec 1 12:40:24 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 1 Dec 2022 12:40:24 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: <77laS-jw0T6AQuU4IsXVkoGEh3TqsFuxKhBKRoLwk8g=.8ea22ac4-5389-47e7-a1b1-21fbabd0091c@github.com> On Thu, 1 Dec 2022 12:22:29 GMT, Roland Westrelin wrote: >> Backout that change due to hard to reproduce failures and some >> failures with stress options. This includes the backout of JDK-8297556 >> and JDK-8297343 which are fixes to the initial change. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11450 From kvn at openjdk.org Thu Dec 1 12:45:21 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 1 Dec 2022 12:45:21 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 12:22:29 GMT, Roland Westrelin wrote: >> Backout that change due to hard to reproduce failures and some >> failures with stress options. This includes the backout of JDK-8297556 >> and JDK-8297343 which are fixes to the initial change. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Good. Waiting testing results. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11450 From rcastanedalo at openjdk.org Thu Dec 1 13:14:04 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 1 Dec 2022 13:14:04 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 09:37:57 GMT, Christian Hagedorn wrote: > ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) > > During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: > https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 > > Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. > > I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. > > Thanks, > Christian Looks good! Just a couple of questions/suggestions. src/hotspot/share/opto/phaseX.cpp line 1967: > 1965: push_if_not_bottom_type(worklist, cast_ii); > 1966: } > 1967: } Would it make sense to assume there is at most one `CastII` output and replace the loop with a call to the auxiliary function `Node* Node::find_out_with(int opcode)`? test/hotspot/jtreg/compiler/c2/TestCastIIWrongTypeCCP.java line 1: > 1: /* This file might fit better under `test/hotspot/jtreg/compiler/ccp`. test/hotspot/jtreg/compiler/c2/TestCastIIWrongTypeCCP.java line 27: > 25: * @test > 26: * @bug 8297264 > 27: * @summary Test that CastII nodes are added to the CCP worklist if they could or could have been Suggestion: * @summary Test that CastII nodes are added to the CCP worklist if they could have been ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11448 From fjiang at openjdk.org Thu Dec 1 13:59:12 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 1 Dec 2022 13:59:12 GMT Subject: RFR: 8297953: Fix several C2 IR matching tests for RISC-V Message-ID: Fix several IR matching tests that failed on RISC-V. Rotate Node will be matched only when UseZbb is enabled: - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java RISC-V does not provide float branch instruction, so we do not match CMOVEI for two floating-point comparisons: - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java Testing: - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java -- no tests selected as expected. - With `-XX:+UseZbb`: - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- passed - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- passed - With `-XX:-UseZbb`: - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- no tests selected as expected - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- no tests selected as expected ------------- Commit messages: - fix some irtest failed on riscv Changes: https://git.openjdk.org/jdk/pull/11453/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11453&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297953 Stats: 4 lines in 4 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11453.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11453/head:pull/11453 PR: https://git.openjdk.org/jdk/pull/11453 From bulasevich at openjdk.org Thu Dec 1 14:04:49 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 1 Dec 2022 14:04:49 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v13] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: minor api refactoring: start_scope and roll_back instead of position and set_position ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/a24683d9..477609da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=11-12 Stats: 95 lines in 3 files changed: 19 ins; 1 del; 75 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From fyang at openjdk.org Thu Dec 1 14:08:30 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 1 Dec 2022 14:08:30 GMT Subject: RFR: 8297953: Fix several C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 13:50:23 GMT, Feilong Jiang wrote: > Fix several IR matching tests that failed on RISC-V. > > Rotate Node will be matched only when UseZbb is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java > > RISC-V does not provide float branch instruction, so we do not match CMOVEI for two floating-point comparisons: > - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java > > Testing: > - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java -- no tests selected as expected. > > - With `-XX:+UseZbb`: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- passed > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- passed > > - With `-XX:-UseZbb`: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- no tests selected as expected > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- no tests selected as expected Looks reasonable to me. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11453 From thartmann at openjdk.org Thu Dec 1 14:19:35 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 1 Dec 2022 14:19:35 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 12:22:29 GMT, Roland Westrelin wrote: >> Backout that change due to hard to reproduce failures and some >> failures with stress options. This includes the backout of JDK-8297556 >> and JDK-8297343 which are fixes to the initial change. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Testing is all clean. ------------- PR: https://git.openjdk.org/jdk/pull/11450 From roland at openjdk.org Thu Dec 1 14:19:37 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 1 Dec 2022 14:19:37 GMT Subject: RFR: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:16:02 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > Testing is all clean. @TobiHartmann @vnkozlov thanks for the reviews/testing. ------------- PR: https://git.openjdk.org/jdk/pull/11450 From roland at openjdk.org Thu Dec 1 14:23:17 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 1 Dec 2022 14:23:17 GMT Subject: Integrated: 8297934: [BACKOUT] Compiler should only use verified interface types for optimization In-Reply-To: References: Message-ID: <4gyfhe5_7w-C1CdWlSeoEIlyRvLYfaw9t9rCTwaiOpw=.ad39bf49-4645-4047-8cf6-52a284ee78e1@github.com> On Thu, 1 Dec 2022 10:49:53 GMT, Roland Westrelin wrote: > Backout that change due to hard to reproduce failures and some > failures with stress options. This includes the backout of JDK-8297556 > and JDK-8297343 which are fixes to the initial change. This pull request has now been integrated. Changeset: 9430f3e6 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/9430f3e65c4900e121858dc111b6f20207e0694f Stats: 1669 lines in 21 files changed: 511 ins; 823 del; 335 mod 8297934: [BACKOUT] Compiler should only use verified interface types for optimization Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11450 From chagedorn at openjdk.org Thu Dec 1 14:29:27 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 14:29:27 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph Message-ID: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. Thanks, Christian ------------- Commit messages: - Fix whitespaces - 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph Changes: https://git.openjdk.org/jdk/pull/11452/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11452&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290850 Stats: 552 lines in 3 files changed: 455 ins; 32 del; 65 mod Patch: https://git.openjdk.org/jdk/pull/11452.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11452/head:pull/11452 PR: https://git.openjdk.org/jdk/pull/11452 From chagedorn at openjdk.org Thu Dec 1 14:40:53 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 14:40:53 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v2] In-Reply-To: References: Message-ID: > ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) > > During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: > https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 > > Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. > > I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/c2/TestCastIIWrongTypeCCP.java Co-authored-by: Roberto Casta?eda Lozano ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11448/files - new: https://git.openjdk.org/jdk/pull/11448/files/79e13192..6064af50 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11448&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11448&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11448.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11448/head:pull/11448 PR: https://git.openjdk.org/jdk/pull/11448 From chagedorn at openjdk.org Thu Dec 1 14:44:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 14:44:00 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: > ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) > > During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: > https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 > > Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. > > I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Move test to ccp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11448/files - new: https://git.openjdk.org/jdk/pull/11448/files/6064af50..f9dcf645 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11448&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11448&range=01-02 Stats: 0 lines in 1 file changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11448.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11448/head:pull/11448 PR: https://git.openjdk.org/jdk/pull/11448 From chagedorn at openjdk.org Thu Dec 1 14:44:01 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 14:44:01 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:39:49 GMT, Christian Hagedorn wrote: >> ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) >> >> During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: >> https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 >> >> Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. >> >> I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Move test to ccp Thanks Roberto for your review! I've updated the patch accordingly. ------------- PR: https://git.openjdk.org/jdk/pull/11448 From chagedorn at openjdk.org Thu Dec 1 14:44:04 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 14:44:04 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 13:04:14 GMT, Roberto Casta?eda Lozano wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Move test to ccp > > src/hotspot/share/opto/phaseX.cpp line 1967: > >> 1965: push_if_not_bottom_type(worklist, cast_ii); >> 1966: } >> 1967: } > > Would it make sense to assume there is at most one `CastII` output and replace the loop with a call to the auxiliary function `Node* Node::find_out_with(int opcode)`? Unfortunately, in this test case, we have 2 `CastII` nodes that need to be re-added. So, I think that `find_out_with()` does not work here in general (even though in this test case, both `CastII` nodes would end up again on the worklist but that is probably not guaranteed in general). > test/hotspot/jtreg/compiler/c2/TestCastIIWrongTypeCCP.java line 1: > >> 1: /* > > This file might fit better under `test/hotspot/jtreg/compiler/ccp`. That's a good point, I'll move it over there. ------------- PR: https://git.openjdk.org/jdk/pull/11448 From chagedorn at openjdk.org Thu Dec 1 14:53:21 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 14:53:21 GMT Subject: RFR: 8297953: Fix several C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 13:50:23 GMT, Feilong Jiang wrote: > Fix several IR matching tests that failed on RISC-V. > > Rotate Node will be matched only when UseZbb is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java > > RISC-V does not provide float branch instruction, so we do not match CMOVEI for two floating-point comparisons: > - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java > > Testing: > - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java -- no tests selected as expected. > > - With `-XX:+UseZbb`: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- passed > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- passed > > - With `-XX:-UseZbb`: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- no tests selected as expected > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- no tests selected as expected Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11453 From chagedorn at openjdk.org Thu Dec 1 14:56:34 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 1 Dec 2022 14:56:34 GMT Subject: RFR: 8297951: C2: Create skeleton predicates for all If nodes in loop predication Message-ID: We currently only create skeleton predicates for `RangeCheck` nodes and not for normal `If` nodes: https://github.com/openjdk/jdk/blob/2cb64a75578ccc15a1dfc8c2843aa11d05ca8aa7/src/hotspot/share/opto/loopPredicate.cpp#L1344-L1346 But it is also possible to create range check predicates in loop predication for `If` nodes if they have the right pattern checked in `PhaseIdealLoop::loop_predication_impl()` and `IdealLoopTree::is_range_check_if()`. This, however, is much more rare. Without skeleton predicates for these `If` nodes, we could run into the same problems already fixed for `RangeCheck` nodes (see [JDK-8193130](https://bugs.openjdk.org/browse/JDK-8193130) and related bugs). This is almost impossible to trigger in practice as it needs a very specific setup and the right optimizations to be applied. But the test case shows such a case where we hit an assert due to a broken memory graph because we are missing skeleton predicates. I therefore propose to always create skeleton predicates for hoisted range checks in loop predication. Thanks, Christian ------------- Commit messages: - Fix whitespaces - 8297951: C2: Create skeleton predicates for all If nodes in loop predication Changes: https://git.openjdk.org/jdk/pull/11454/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11454&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297951 Stats: 85 lines in 2 files changed: 78 ins; 2 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/11454.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11454/head:pull/11454 PR: https://git.openjdk.org/jdk/pull/11454 From thartmann at openjdk.org Thu Dec 1 15:00:19 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 1 Dec 2022 15:00:19 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: <_ujyUUvEq4MgU9Jg_goa5E4yQa59cLf9RMJ8R_FJOwY=.bdd54178-cc2b-4983-80c8-378ce8b12f5d@github.com> On Thu, 1 Dec 2022 14:44:00 GMT, Christian Hagedorn wrote: >> ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) >> >> During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: >> https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 >> >> Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. >> >> I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Move test to ccp Still looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11448 From bulasevich at openjdk.org Thu Dec 1 15:52:31 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 1 Dec 2022 15:52:31 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v11] In-Reply-To: References: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> Message-ID: <9S_IaUpZLkeswAtj6AJG2wg1YYACsrFNMjBYvGu5FVk=.b0f37f2d-3bbd-4ef7-8512-32b429410068@github.com> On Wed, 16 Nov 2022 22:52:51 GMT, Evgeny Astigeevich wrote: >>> // Roll the stream state back to the marked one. >>> void roll_back(); >> >> get_position() is not about roll back only. See the DebugInformationRecorder, it serializes offsets into the stream. >> >> >> int DebugInformationRecorder::serialize_scope_values(...) { >> ... >> int result = stream()->position(); >> ... >> return result; >> } >> >> DebugToken* DebugInformationRecorder::create_scope_values(...) { >> ... >> return (DebugToken*) (intptr_t) serialize_scope_values(values); >> } >> >> void PhaseOutput::Process_OopMap_Node(MachNode *mach, int current_offset) { >> ... >> DebugToken *locvals = C->debug_info()->create_scope_values(locarray); >> DebugToken *expvals = C->debug_info()->create_scope_values(exparray); >> DebugToken *monvals = C->debug_info()->create_monitor_values(monarray); >> >> C->debug_info()->describe_scope( >> ... >> locvals, >> expvals, >> monvals >> ); >> >> void DebugInformationRecorder::describe_scope(... >> DebugToken* locals, >> DebugToken* expressions, >> DebugToken* monitors) { >> ... >> // serialize the locals/expressions/monitors >> stream()->write_int((intptr_t) locals); >> stream()->write_int((intptr_t) expressions); >> stream()->write_int((intptr_t) monitors); > > Thank you for the information. It's very helpful. > > I think we should not simulate `CompressedWriteStream`. > > `DebugInformationRecorder` needs certain operations: > We write debug info into a stream writer: as grouped multiple data and single data. We need to know where bytes of grouped data begin and end. We need to keep offsets of grouped data in the stream. We need to be able to discard last written grouped data. We need to get the number of used bytes. We don't need to know how data stored in a stream. > > Based on the specification, we need a stream writer to provide operations: > > // Start grouped data. > // Return a position (byte offset) in the stream where grouped data begins. > int start_group(); > > // Finish grouped data. > // Return a position (byte offset) in the stream where grouped data ends. > int finish_group(); > > // Revert the stream to the specified position. > void set_position(int pos); > > // Return the number of bytes stored data uses. > int data_size() const; > > > With them we don't have a function which in one implementation is const but in another implementation is with side effects. IMHO, at some point later side effects will cause bugs. > > Possible implementations: > > int start_group() { > complete_current_byte(); // this is renamed align() > return _position; > } > > int finish_group() { > complete_current_byte(); > return _position; > } > > int data_size() const { > if (_position == 0 && _bit_position == 0) > return 0; > > int used_bytes = _position; > if (_bit_position != 0) > ++used_bytes; > return used_bytes; > } > > > In `DebugInformationRecorder` we will need to replace `position()` with `start_group()` and add `finish_group()`. > We will need to change `int DebugInformationRecorder::find_sharable_decode_offset(int stream_offset)` to > `int DebugInformationRecorder::find_sharable_decode_offset(int data_begin_offset, int data_end_offset)`. > > If we want `DebugInformationRecorder` to use `CompressedWriteStream` we can use an adapter: > > class CompressedWriteStreamAdapter: public CompressedWriteStream { > public: > ... > int start_group() { > return position(); > } > > int finish_group() { > return position(); > } > > int data_size() const { > return position(); > } > > }; Sorry for not being online for a while :) As you proposed, I refactored the CompressedSparseDataWriteStream interface. I removed position() and set_position(int) methods. I added start_scope() and roll_back(int) instead. Plus I have added is_empty() method (without side effect) to avoid unintentional align from assert. Plus I added data_size() method (without side effect) - it is used to check the amount of data to copy. Please check if it is Ok now. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From duke at openjdk.org Thu Dec 1 16:01:22 2022 From: duke at openjdk.org (Matthijs Bijman) Date: Thu, 1 Dec 2022 16:01:22 GMT Subject: Integrated: 8293294: Remove dead code in Parse::check_interpreter_type In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 14:38:35 GMT, Matthijs Bijman wrote: > A small cleanup in Parse::check_interpreter_type to remove two dead declarations. This pull request has now been integrated. Changeset: 4899d782 Author: Matthijs Bijman Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/4899d7829246cf3c082ab3c0df9221853d1520a9 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod 8293294: Remove dead code in Parse::check_interpreter_type Reviewed-by: vlivanov, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11325 From kvn at openjdk.org Thu Dec 1 16:30:30 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 1 Dec 2022 16:30:30 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v2] In-Reply-To: References: <6TevexWCuJcFlfnzo3ZRLWmkV7QTiZCKk_gdFXEWwv0=.0b79374a-3c34-4c20-bc90-0ca8df0acbcf@github.com> Message-ID: On Wed, 30 Nov 2022 10:51:55 GMT, Roland Westrelin wrote: >> I mean we never assign control edge to `Bool` and `Cmp` nodes - they depend only on their inputs. At least I don't know about it. > > `set_ctrl()` doesn't change the control input of the nodes, right? It only updates the current loop opts pass's table of controls and all data nodes are in that table. I'm confused by what could be wrong here. You are right, I forgot that this function only update site table. Good. ------------- PR: https://git.openjdk.org/jdk/pull/11391 From kvn at openjdk.org Thu Dec 1 16:34:19 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 1 Dec 2022 16:34:19 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v2] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 10:19:42 GMT, Roland Westrelin wrote: >> A main loop loses its pre loop. The Opaque1 node for the zero trip >> guard of the main loop is assigned control at a Region through which >> an If is split. As a result, the Opaque1 is cloned and the zero trip >> guard takes a Phi that merges Opaque1 nodes. One of the branch dies >> next and as, a result, the zero trip guard has an Opaque1 as input but >> at the wrong CmpI input. The assert fires next. >> >> The fix I propose is that if an Opaque1 node that is part of a zero >> trip guard is encountered during split if, rather than split if up or >> down, instead, assign it the control of the zero trip guard's >> control. This way the pattern of the zero trip guard is unaffected and >> split if can proceed. I believe it's safe to assign it a later >> control: >> >> - an Opaque1 can't be shared >> >> - the zero trip guard can't be the If that's being split >> >> As Vladimir noted, this bug used to not reproduce with loop strip >> mining disabled but now always reproduces because the loop >> strip mining nest is always constructed. The reason is that the >> main loop in this test is kept alive by the LSM safepoint. If the >> LSM loop nest is not constructed, the loop is optimized out. I >> filed: >> >> https://bugs.openjdk.org/browse/JDK-8297724 >> >> for this issue. > > Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: > > - more > - more > - review Looks nice! And Tobias's testing results also looks good so far (only known failures). ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11391 From kvn at openjdk.org Thu Dec 1 16:39:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 1 Dec 2022 16:39:06 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:44:00 GMT, Christian Hagedorn wrote: >> ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) >> >> During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: >> https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 >> >> Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. >> >> I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Move test to ccp Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11448 From aph at openjdk.org Thu Dec 1 16:54:33 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 1 Dec 2022 16:54:33 GMT Subject: RFR: JDK-8297968: Crash in PrintOptoAssembly Message-ID: If PrintOptoAssembly is used in an optimized build, we have a crash in `PhaseChaitin::dump_frame()` due to reading from uninitialized memory in the _parm_regs array. The fix is trivial. ------------- Commit messages: - JDK-8297968: Crash in PrintOptoAssembly Changes: https://git.openjdk.org/jdk/pull/11460/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11460&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297968 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11460.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11460/head:pull/11460 PR: https://git.openjdk.org/jdk/pull/11460 From qamai at openjdk.org Thu Dec 1 17:33:38 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 1 Dec 2022 17:33:38 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op The motivation for these intrinsics, aside from unsigned comparison being a fairly basic operation, is for range checks of a load/store, which has the form of `0 <= offset && offset <= length - size`, by transforming this into `0 <= length - size && offset u<= length - size`, the first comparison as well as the computation of `length - size` can be hoisted out of the loop, which results in a single operation being loop varying. The reason this may not be recognised effectively by the idealiser is that if `size` is a constant, `(length - size) + MIN_VALUE` being folded into `length + (MIN_VALUE - size)`, breaks the pattern. If we simply want to throw an exception in out-of-bound cases, then `Precondition::checkIndex` may suffice. This however may not be adequate if: - We want to do something else. If the hardware does not support masked load, currently we do a load followed by a blend if the whole vector is inbound and fall back out of intrinsic otherwise. - The bound is not provably loop-invariant, and not obviously non-negative. This may arise in `ArrayList` accesses, where bound checks are performed against the `size` field, which may need to be reloaded on each iteration and not obviously nonnegative to the compiler. IMO the direct result of the method is less important, because the contract does not have any promise with respect to the exact return value, and the only thing that can be done with it is to compare it with 0, which will certainly be folded into a `CmpU` node. @shqking I think your benchmark is not good, as the error is higher than the actual difference between samples, and the cost of the division may swallow everything inside the loop. Thanks a lot. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From kvn at openjdk.org Thu Dec 1 18:49:14 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 1 Dec 2022 18:49:14 GMT Subject: RFR: JDK-8297968: Crash in PrintOptoAssembly In-Reply-To: References: Message-ID: <74iHAuIwpS7cc-JXcDRl-OUfpexc9eICkk0-T2DsSLQ=.b6695e97-7675-4ab8-a9a6-59692750c933@github.com> On Thu, 1 Dec 2022 16:46:16 GMT, Andrew Haley wrote: > If PrintOptoAssembly is used in an optimized build, we have a crash in `PhaseChaitin::dump_frame()` due to reading from uninitialized memory in the _parm_regs array. The fix is trivial. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11460 From aph at openjdk.org Thu Dec 1 20:36:49 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 1 Dec 2022 20:36:49 GMT Subject: Integrated: JDK-8297968: Crash in PrintOptoAssembly In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 16:46:16 GMT, Andrew Haley wrote: > If PrintOptoAssembly is used in an optimized build, we have a crash in `PhaseChaitin::dump_frame()` due to reading from uninitialized memory in the _parm_regs array. The fix is trivial. This pull request has now been integrated. Changeset: c69aa42d Author: Andrew Haley URL: https://git.openjdk.org/jdk/commit/c69aa42d02dba4612998d6ecdc57286774da9d33 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8297968: Crash in PrintOptoAssembly Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/11460 From bulasevich at openjdk.org Fri Dec 2 05:20:17 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 2 Dec 2022 05:20:17 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v14] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: - minor api refactoring: start_scope and roll_back instead of position and set_position - buffer() returns const array - cleanup, rename - warning fix - add test for buffer grow - adding jtreg test for CompressedSparseDataReadStream impl - align java impl to cpp impl - rewrite the SparseDataWriteStream not to use _curr_byte - introduce and call flush() excplicitly, add the gtest - minor renaming. adding encoding examples table - ... and 7 more: https://git.openjdk.org/jdk/compare/b035056d...1d6cdb73 ------------- Changes: https://git.openjdk.org/jdk/pull/10025/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=13 Stats: 547 lines in 12 files changed: 519 ins; 0 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From svkamath at openjdk.org Fri Dec 2 06:47:53 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Fri, 2 Dec 2022 06:47:53 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs Message-ID: Hi All, I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. Following are the performance numbers of JMH micro Fp16ConversionBenchmark: Before code changes: Benchmark | (size) | Mode | Cnt | Score | Error | Units Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms After: Benchmark | (size) | Mode | Cnt | Score | Error | Units Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms Kindly review and share your feedback. Thanks. Smita ------------- Commit messages: - Auto vectorize half precision floating point conversion APIs Changes: https://git.openjdk.org/jdk/pull/11471/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294588 Stats: 153 lines in 7 files changed: 151 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Fri Dec 2 06:47:53 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Fri, 2 Dec 2022 06:47:53 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:22:39 GMT, Smita Kamath wrote: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita label /hotspot ------------- PR: https://git.openjdk.org/jdk/pull/11471 From shade at openjdk.org Fri Dec 2 08:13:11 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 2 Dec 2022 08:13:11 GMT Subject: RFR: 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension In-Reply-To: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> References: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> Message-ID: On Tue, 29 Nov 2022 03:35:12 GMT, Fei Yang wrote: > The single-bit instructions from the Zbs extension provide a mechanism to set, clear, > invert, or extract a single bit in a register. The bit is specified by its index. > > Especially, the single-bit extract (immediate) instruction 'bexti rd, rs1, shamt' [1] performs: > > let index = shamt & (XLEN - 1); > X(rd) = (X(rs1) >> index) & 1; > > > This instruction is a perfect match for following C2 sub-graph when integer immediate 'mask' is power of 2: > > Set dst (Conv2B (AndI src mask)) > > > The effect is that we could then optimize C2 JIT code for methods like [2]: > Before: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > andi R7, R28, #8 #@andI_reg_imm > snez R10, R7 #@convI2Bool > > > After: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > bexti R10, R28, 3 # > > Note that I am not adding a matching rule for long->bool case as I see C2 compiler currently only catches and builds int->bool conversions [3]. I guess there might be no such use cases for long->bool conversion in the real-world. > > Testing: Tier1-3 hotspot & jdk tested with QEMU (JTREG="VM_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseZbs"). > > [1] https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > [2] https://github.com/openjdk/jdk/blob/master/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/dom/NodeImpl.java#L1936 > > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1435 Looks reasonable. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/11406 From dzhang at openjdk.org Fri Dec 2 08:22:18 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 2 Dec 2022 08:22:18 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: <9-VBLHqlQtMxglD1JHxLMF5fN3_aj3NJkBQrj5pGv5Y=.15fb5f5b-a9fe-4d50-9802-34c84e76c3fe@github.com> On Fri, 25 Nov 2022 10:21:42 GMT, Vladimir Kempik wrote: >> The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. >> >> We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. >> >> We can use the JMH test from https://github.com/openjdk/jdk/pull/10332. Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly`, the compilation log of `floatIndexVector` is as follows: >> >> >> 120 vloadcon V2 # generate iota indices >> 12c vfmul.vv V1, V2, V1 #@vmulF >> 134 vfmv.v.f V2, F8 #@replicateF >> 13c vfadd.vv V1, V2, V1 #@vaddF >> >> The above nodes match the logic of `Compute indexes with "vec + iota * scale"` in https://github.com/openjdk/jdk/pull/10332, which is the operation corresponding to `addIndex` in benchmark: >> https://github.com/openjdk/jdk/blob/d6102110e1b48c065292db83744245a33e269cc2/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java#L92-L97 >> >> At the same time, the following assembly code will be generated when running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: >> >> 0x000000401443cc9c: .4byte 0x10072d7 >> 0x000000401443cca0: .4byte 0x5208a157 >> 0x000000401443cca4: .4byte 0x4a219157 >> >> `0x10072d7/0x5208a1d7` is the machine code for `vsetvli/vid.v` and `0x4a219157` is the additional machine code for `vfcvt.f.x.v`, which are the opcodes generated by `is_floating_point_type(bt)`: >> >> if (is_floating_point_type(bt)) { >> __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); >> } >> >> >> After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java >> [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> >> - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) >> - hotspot, jdk and langtools tier2 without new failures (release with UseRVV on QEMU) >> - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) > > Can you also run whole tier2 please ? @VladimirKempik @RealFYang @zifeihan Thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/11344 From fyang at openjdk.org Fri Dec 2 08:27:08 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 2 Dec 2022 08:27:08 GMT Subject: RFR: 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension In-Reply-To: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> References: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> Message-ID: On Tue, 29 Nov 2022 03:35:12 GMT, Fei Yang wrote: > The single-bit instructions from the Zbs extension provide a mechanism to set, clear, > invert, or extract a single bit in a register. The bit is specified by its index. > > Especially, the single-bit extract (immediate) instruction 'bexti rd, rs1, shamt' [1] performs: > > let index = shamt & (XLEN - 1); > X(rd) = (X(rs1) >> index) & 1; > > > This instruction is a perfect match for following C2 sub-graph when integer immediate 'mask' is power of 2: > > Set dst (Conv2B (AndI src mask)) > > > The effect is that we could then optimize C2 JIT code for methods like [2]: > Before: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > andi R7, R28, #8 #@andI_reg_imm > snez R10, R7 #@convI2Bool > > > After: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > bexti R10, R28, 3 # > > Note that I am not adding a matching rule for long->bool case as I see C2 compiler currently only catches and builds int->bool conversions [3]. I guess there might be no such use cases for long->bool conversion in the real-world. > > Testing: Tier1-3 hotspot & jdk tested with QEMU (JTREG="VM_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseZbs"). > > [1] https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > [2] https://github.com/openjdk/jdk/blob/master/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/dom/NodeImpl.java#L1936 > > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1435 Thank you all for the review! ------------- PR: https://git.openjdk.org/jdk/pull/11406 From fyang at openjdk.org Fri Dec 2 08:29:34 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 2 Dec 2022 08:29:34 GMT Subject: Integrated: 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension In-Reply-To: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> References: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> Message-ID: On Tue, 29 Nov 2022 03:35:12 GMT, Fei Yang wrote: > The single-bit instructions from the Zbs extension provide a mechanism to set, clear, > invert, or extract a single bit in a register. The bit is specified by its index. > > Especially, the single-bit extract (immediate) instruction 'bexti rd, rs1, shamt' [1] performs: > > let index = shamt & (XLEN - 1); > X(rd) = (X(rs1) >> index) & 1; > > > This instruction is a perfect match for following C2 sub-graph when integer immediate 'mask' is power of 2: > > Set dst (Conv2B (AndI src mask)) > > > The effect is that we could then optimize C2 JIT code for methods like [2]: > Before: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > andi R7, R28, #8 #@andI_reg_imm > snez R10, R7 #@convI2Bool > > > After: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > bexti R10, R28, 3 # > > Note that I am not adding a matching rule for long->bool case as I see C2 compiler currently only catches and builds int->bool conversions [3]. I guess there might be no such use cases for long->bool conversion in the real-world. > > Testing: Tier1-3 hotspot & jdk tested with QEMU (JTREG="VM_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseZbs"). > > [1] https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > [2] https://github.com/openjdk/jdk/blob/master/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/dom/NodeImpl.java#L1936 > > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1435 This pull request has now been integrated. Changeset: d50015af Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/d50015af99f44909bf71fd2de97546d47cda86d6 Stats: 26 lines in 4 files changed: 24 ins; 0 del; 2 mod 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension Reviewed-by: fjiang, yadongwang, shade ------------- PR: https://git.openjdk.org/jdk/pull/11406 From dzhang at openjdk.org Fri Dec 2 08:33:04 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 2 Dec 2022 08:33:04 GMT Subject: Integrated: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 05:40:12 GMT, Dingli Zhang wrote: > The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. > > We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. > > We can use the JMH test from https://github.com/openjdk/jdk/pull/10332. Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly`, the compilation log of `floatIndexVector` is as follows: > > > 120 vloadcon V2 # generate iota indices > 12c vfmul.vv V1, V2, V1 #@vmulF > 134 vfmv.v.f V2, F8 #@replicateF > 13c vfadd.vv V1, V2, V1 #@vaddF > > The above nodes match the logic of `Compute indexes with "vec + iota * scale"` in https://github.com/openjdk/jdk/pull/10332, which is the operation corresponding to `addIndex` in benchmark: > https://github.com/openjdk/jdk/blob/d6102110e1b48c065292db83744245a33e269cc2/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java#L92-L97 > > At the same time, the following assembly code will be generated when running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: > > 0x000000401443cc9c: .4byte 0x10072d7 > 0x000000401443cca0: .4byte 0x5208a157 > 0x000000401443cca4: .4byte 0x4a219157 > > `0x10072d7/0x5208a1d7` is the machine code for `vsetvli/vid.v` and `0x4a219157` is the additional machine code for `vfcvt.f.x.v`, which are the opcodes generated by `is_floating_point_type(bt)`: > > if (is_floating_point_type(bt)) { > __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); > } > > > After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java > [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - hotspot, jdk and langtools tier2 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) This pull request has now been integrated. Changeset: 687fd714 Author: Dingli Zhang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/687fd714bbc390f486272e05452f038bc3631be1 Stats: 18 lines in 1 file changed: 17 ins; 1 del; 0 mod 8297549: RISC-V: Add support for Vector API vector load const operation Reviewed-by: fyang, gcao ------------- PR: https://git.openjdk.org/jdk/pull/11344 From rcastanedalo at openjdk.org Fri Dec 2 08:47:06 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 2 Dec 2022 08:47:06 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:36:05 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/phaseX.cpp line 1967: >> >>> 1965: push_if_not_bottom_type(worklist, cast_ii); >>> 1966: } >>> 1967: } >> >> Would it make sense to assume there is at most one `CastII` output and replace the loop with a call to the auxiliary function `Node* Node::find_out_with(int opcode)`? > > Unfortunately, in this test case, we have 2 `CastII` nodes that need to be re-added. So, I think that `find_out_with()` does not work here in general (even though in this test case, both `CastII` nodes would end up again on the worklist but that is probably not guaranteed in general). I see, thanks for the explanation! ------------- PR: https://git.openjdk.org/jdk/pull/11448 From thartmann at openjdk.org Fri Dec 2 08:50:13 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 2 Dec 2022 08:50:13 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v2] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 10:19:42 GMT, Roland Westrelin wrote: >> A main loop loses its pre loop. The Opaque1 node for the zero trip >> guard of the main loop is assigned control at a Region through which >> an If is split. As a result, the Opaque1 is cloned and the zero trip >> guard takes a Phi that merges Opaque1 nodes. One of the branch dies >> next and as, a result, the zero trip guard has an Opaque1 as input but >> at the wrong CmpI input. The assert fires next. >> >> The fix I propose is that if an Opaque1 node that is part of a zero >> trip guard is encountered during split if, rather than split if up or >> down, instead, assign it the control of the zero trip guard's >> control. This way the pattern of the zero trip guard is unaffected and >> split if can proceed. I believe it's safe to assign it a later >> control: >> >> - an Opaque1 can't be shared >> >> - the zero trip guard can't be the If that's being split >> >> As Vladimir noted, this bug used to not reproduce with loop strip >> mining disabled but now always reproduces because the loop >> strip mining nest is always constructed. The reason is that the >> main loop in this test is kept alive by the LSM safepoint. If the >> LSM loop nest is not constructed, the loop is optimized out. I >> filed: >> >> https://bugs.openjdk.org/browse/JDK-8297724 >> >> for this issue. > > Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: > > - more > - more > - review Looks good to me. Testing passed. src/hotspot/share/opto/subnode.cpp line 1456: > 1454: // Do not muck with Opaque1 nodes, as this indicates a loop > 1455: // guard that cannot change shape. > 1456: if( con->is_Con() && !cmp2->is_Con() && cmp2_op != Op_OpaqueZeroTripGuard && Suggestion: if (con->is_Con() && !cmp2->is_Con() && cmp2_op != Op_OpaqueZeroTripGuard && ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11391 From rcastanedalo at openjdk.org Fri Dec 2 08:50:54 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 2 Dec 2022 08:50:54 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: <5blRganvNVE8iNVJ6AT-DP4WpNaVnVw0oyd24TPF2Tc=.3d8ec2f3-5cc6-40ae-bf29-e206f339abc7@github.com> On Thu, 1 Dec 2022 14:44:00 GMT, Christian Hagedorn wrote: >> ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) >> >> During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: >> https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 >> >> Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. >> >> I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Move test to ccp Thanks for addressing my feedback, Christian! Just one thing, you probably want to update the test package name for consistency. test/hotspot/jtreg/compiler/ccp/TestCastIIWrongTypeCCP.java line 29: > 27: * @summary Test that CastII nodes are added to the CCP worklist if they could have been > 28: * optimized due to a CmpI/If pattern. > 29: * @run main/othervm -Xcomp -XX:CompileCommand=compileonly,compiler.c2.TestCastIIWrongTypeCCP::* Suggestion: * @run main/othervm -Xcomp -XX:CompileCommand=compileonly,compiler.ccp.TestCastIIWrongTypeCCP::* test/hotspot/jtreg/compiler/ccp/TestCastIIWrongTypeCCP.java line 30: > 28: * optimized due to a CmpI/If pattern. > 29: * @run main/othervm -Xcomp -XX:CompileCommand=compileonly,compiler.c2.TestCastIIWrongTypeCCP::* > 30: * compiler.c2.TestCastIIWrongTypeCCP Suggestion: * compiler.ccp.TestCastIIWrongTypeCCP test/hotspot/jtreg/compiler/ccp/TestCastIIWrongTypeCCP.java line 32: > 30: * compiler.c2.TestCastIIWrongTypeCCP > 31: */ > 32: package compiler.c2; Suggestion: package compiler.ccp; ------------- Changes requested by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11448 From epeter at openjdk.org Fri Dec 2 09:20:44 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 2 Dec 2022 09:20:44 GMT Subject: RFR: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination Message-ID: **Will hold this back until JDK21**, unless we decide it is a regression-fix for [JDK-8294217](https://bugs.openjdk.org/browse/JDK-8294217). The problem is only a not-quite-correct assert. But the problem is not limited to infinite loops, as the example below shows it can happen with reducible loops. **Background:** We have an assert that checks that `has_loops` is true when it should be. If we have `has_loops == false` even though there are loops, we will not perform loop-opts in `Compile::Optimize`. https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4285-L4293 Generally, we want to verify, that if we just found loops (`_ltree_root->_child != NULL`) that `has_loops == true`. There are a few cases where we do not care if we miss loop-opts: - We only have infinite loops (`only_has_infinite_loops()`). Infinite loops never terminate anyway, so why make them faster? Plus, a loop is only infinite if it has no loop-exit other than a `NeverBranch` exit, even uncommon traps, loop-limit checks etc are exits. Thus, if a loop does anything interesting, it probably is not such a "true infinite loop". They can be more easily forced to occur by setting `-XX:PerMethodTrapLimit=0`. - We have only exception edges. Note that once we check the assert, we update `has_loops`. So if all loops disappeared, we avoid doing loop-opts henceforth. **Current implementation of PhaseIdealLoop::only_has_infinite_loops** https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4183-L4185 We check for loop exits, if there is one the loop should not be infinite. **The Problem** An irreducible loop can have an inner loop, that subsequently loses its exit. It becomes its own irreducible loop, and floats out of the outer loop. Where the outer loop enters into the former inner loop, we now have a loop-exit for the outer loop. The next time we run `build_loop_tree` and check the assert, it can fail, as `PhaseIdealLoop::only_has_infinite_loops` finds that new loop-exit from outer to inner loop. Example: `TestOnlyInfiniteLoops::test_simple` (click on images to see them larger) Nested infinite loop before loop-opts: After `build_loop_tree`, the outer loop is detected as infinite, and `NeverBranch` is inserted. No loop is attached to loop-tree, as we do not attach newly discovered infinite loops. We will set `has_loops == false` after first loop-opts round. During IGVN of first loop-opts round, some edges die. `88 IfTrue` is dominated by `52 IfTrue` (dominator info only becomes present during loop-opts). The outer loop now exits into the inner loop. The second loop-opts round detects the former inner loop as an infinite loop, inserts NeverBranch. Once we run the assert, we see that we have `has_loops == false`, but `PhaseIdealLoop::only_has_infinite_loops` finds the former outer loop's exit. **Solution** If we ever only have infinite loops, then there will never be a way to get from any of those loops down to Root, except through a NeverBranch exit. So even if such an (outer) infinite loop ever has an exit, that exit cannot ever lead to Root, other than a NeverBranch exit. Thus, it is ok to still consider that loop as "infinite", even though it itself has an exit - that exit will never lead to termination. Thus, I changed the `PhaseIdealLoop::only_has_infinite_loops` to check if any of the loops ever connect down to Root, except through NeverBranch nodes. **Alternative Fix** An alternative idea to my fix here: just replace the infinite loop with a uncommon trap, and if the infinite loop is ever hit revert back to the interpreter. If we do not care to optimize infinite loops, then why even compile them? Advantages of that idea: No need for `NeverBranch`, no need for special-casing infinite loops. I'm looking forward to your feedback, Emanuel ------------- Commit messages: - 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination Changes: https://git.openjdk.org/jdk/pull/11473/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11473&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297642 Stats: 184 lines in 3 files changed: 153 ins; 9 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/11473.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11473/head:pull/11473 PR: https://git.openjdk.org/jdk/pull/11473 From thartmann at openjdk.org Fri Dec 2 09:51:05 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 2 Dec 2022 09:51:05 GMT Subject: RFR: 8297951: C2: Create skeleton predicates for all If nodes in loop predication In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:17:47 GMT, Christian Hagedorn wrote: > We currently only create skeleton predicates for `RangeCheck` nodes and not for normal `If` nodes: > https://github.com/openjdk/jdk/blob/2cb64a75578ccc15a1dfc8c2843aa11d05ca8aa7/src/hotspot/share/opto/loopPredicate.cpp#L1344-L1346 > > But it is also possible to create range check predicates in loop predication for `If` nodes if they have the right pattern checked in `PhaseIdealLoop::loop_predication_impl()` and `IdealLoopTree::is_range_check_if()`. This, however, is much more rare. > > Without skeleton predicates for these `If` nodes, we could run into the same problems already fixed for `RangeCheck` nodes (see [JDK-8193130](https://bugs.openjdk.org/browse/JDK-8193130) and related bugs). This is almost impossible to trigger in practice as it needs a very specific setup and the right optimizations to be applied. But the test case shows such a case where we hit an assert due to a broken memory graph because we are missing skeleton predicates. > > I therefore propose to always create skeleton predicates for hoisted range checks in loop predication. > > Thanks, > Christian Looks reasonable to me. Great that you were able to find a test! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11454 From bulasevich at openjdk.org Fri Dec 2 10:34:09 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 2 Dec 2022 10:34:09 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v15] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: - minor api refactoring: start_scope and roll_back instead of position and set_position - buffer() returns const array - cleanup, rename - warning fix - add test for buffer grow - adding jtreg test for CompressedSparseDataReadStream impl - align java impl to cpp impl - rewrite the SparseDataWriteStream not to use _curr_byte - introduce and call flush() excplicitly, add the gtest - minor renaming. adding encoding examples table - ... and 7 more: https://git.openjdk.org/jdk/compare/3ce00421...75ae5808 ------------- Changes: https://git.openjdk.org/jdk/pull/10025/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=14 Stats: 547 lines in 12 files changed: 519 ins; 0 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From fjiang at openjdk.org Fri Dec 2 11:28:06 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 2 Dec 2022 11:28:06 GMT Subject: RFR: 8297953: Fix several C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:05:52 GMT, Fei Yang wrote: >> Fix several IR matching tests that failed on RISC-V. >> >> Rotate Node will be matched only when UseZbb is enabled: >> - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java >> - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java >> >> RISC-V does not provide float branch instruction, so we do not match CMOVEI for two floating-point comparisons: >> - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java >> >> Testing: >> - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java -- no tests selected as expected. >> >> - With `-XX:+UseZbb`: >> - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- passed >> - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- passed >> >> - With `-XX:-UseZbb`: >> - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- no tests selected as expected >> - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- no tests selected as expected > > Looks reasonable to me. Thanks. @RealFYang @chhagedorn -- Thanks for looking at this. ------------- PR: https://git.openjdk.org/jdk/pull/11453 From chagedorn at openjdk.org Fri Dec 2 12:23:13 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 2 Dec 2022 12:23:13 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v4] In-Reply-To: References: Message-ID: <54qqrW5YJli-O7QaH_RD0i--5UDxfLGhUn1ri7Wob6U=.31a739d7-0533-43fb-9ba5-4af28486655a@github.com> > ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) > > During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: > https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 > > Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. > > I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Roberto's review Co-authored-by: Roberto Casta?eda Lozano ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11448/files - new: https://git.openjdk.org/jdk/pull/11448/files/f9dcf645..a55d039c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11448&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11448&range=02-03 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11448.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11448/head:pull/11448 PR: https://git.openjdk.org/jdk/pull/11448 From chagedorn at openjdk.org Fri Dec 2 12:23:14 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 2 Dec 2022 12:23:14 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v3] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:44:00 GMT, Christian Hagedorn wrote: >> ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) >> >> During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: >> https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 >> >> Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. >> >> I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Move test to ccp Oh right, good catch, thanks! I've updated it with your suggestions. ------------- PR: https://git.openjdk.org/jdk/pull/11448 From richard.reingruber at sap.com Fri Dec 2 12:32:15 2022 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Fri, 2 Dec 2022 12:32:15 +0000 Subject: C2, ThreadLocalNode, and Loom In-Reply-To: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> References: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> Message-ID: Hi Aleksey, > Any other ideas? Maybe the metadata section of a compiled frame could be extended with a slot for the JavaThread pointer which is then used for spilling? It could then be fix when thawing frames. Richard. From: hotspot-compiler-dev on behalf of Aleksey Shipilev Date: Thursday, 24. November 2022 at 18:25 To: hotspot compiler Subject: C2, ThreadLocalNode, and Loom Hi, In Loom x86_32 port, I am following up on the remaining C2 bugs. I believe one bug can be summarized as follows. C2 models thread-locals with ThreadLocalNode (TLN). TLN is effectively a constant node: its only input is root, and it hashes like a normal node. This was logically sound for decades, because the code never switched the threads. Therefore, code is free to treat thread address as constant. In Loom, this guarantee no longer holds. If we ever store the "old" value of TLN node somewhere and reuse it past the potential yield-resume point, we end up using the *wrong* thread. How is this not a problem we saw before? On most architectures, we have the dedicated thread register, and TLN matches to it. That dedicated thread register holds the true current thread already. AFAICS, most TLN uses go straight to various AddP-s, so we are reasonably safe that no naked TLS addresses are stored, and the majority (all?) uses reference that thread register. (I am not sure what protects us from accidentally "caching" thread register into adhoc one. It would make little sense from performance/compiler standpoint, but I cannot yet see what theoretically prevents it in C2 code.) On x86_32, however, TLN matches to full MacroAssembler::get_thread call, and there storing the thread address into an adhoc register is a normal thing to do. Reusing that register over the continuation switch points visibly breaks x86_32. This usually manifests like a heap corruption because multiple threads stomp over foreign TLABs, or a failure in runtime GC code. Current failures in Loom x86_32 port, for example, can be easily reproduced by adding a simple assert in any G1 runtime method that pulls the (wrong) thread (mis)loaded from C2 barrier: // G1 pre write barrier slowpath JRT_LEAF(void, G1BarrierSetRuntime::write_ref_field_pre_entry(oopDesc* orig, JavaThread* thread)) assert(thread == JavaThread::current(), "write_ref_field_pre_entry sanity"); So, while this manifests on x86_32, I think this is symptom of a larger problem with assuming TLN const-ness. At this point, I am trying to come up with solutions: 1) Dodge the problem in x86_32 and then pretend all arches have dedicated thread registers Sacrifice one x86_32 register for thread address. This would likely to penalize performance a little bit, because x86_32 does not have lots of general purpose registers to begin with. We can probably try and go to FS/GS instead of carrying the address in the register; but I don't know how much work would that entail. This feels like a cowardly way out, and it would still break any future arch that does not have dedicated thread registers. And, it would break if we ever replace and ThreadLocalNode with the call to Thread::current(). 2) Remodel ThreadLocalNode as non-constant What partially solves the problem: saying that ThreadLocalNode::hash() is NO_HASH. AFAICS, this successfully prevents collapsing TLNs in the same compilation. This still does not solve the case where a single TLN gets yanked to the earliest block and its value cached in the register. AFAIU, we only want to make sure that TLN is reloaded after the potential continuation yield, which also serves as the point of return. Since continuation yields are modeled as calls, and calls produce both control and memory, we might need to hook up TLN to either control or memory. I tried to hook up the current control to every TLN node [1]. It works with a few wrinkles, but the patch shows there are ripple effects throughout C2 code, and it sometimes breaks the graph. Some pattern matching code (for example AddP matching code in EA) also asserts, probably assuming that TLNs have no inputs. I suspect other places might have implicit dependencies like these as well. This would be the inevitable consequence for any patch that changes ThreadLocalNode inputs/outputs. 3) Some other easy way out I am overlooking? Any other ideas? -- Thanks, -Aleksey [1] https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcr.openjdk.java.net%2F~shade%2Floom%2Fx86_32%2Ftln-ctrl-1.patch&data=05%7C01%7Crichard.reingruber%40sap.com%7Ccc312cd71fa0412a44b008dace40f0ff%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638049075574372652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FEe7WrKgW9Iu8okF7vwSih1DKfifasXPiV4ccV4ls1M%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From chagedorn at openjdk.org Fri Dec 2 12:38:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 2 Dec 2022 12:38:07 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v2] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 10:19:42 GMT, Roland Westrelin wrote: >> A main loop loses its pre loop. The Opaque1 node for the zero trip >> guard of the main loop is assigned control at a Region through which >> an If is split. As a result, the Opaque1 is cloned and the zero trip >> guard takes a Phi that merges Opaque1 nodes. One of the branch dies >> next and as, a result, the zero trip guard has an Opaque1 as input but >> at the wrong CmpI input. The assert fires next. >> >> The fix I propose is that if an Opaque1 node that is part of a zero >> trip guard is encountered during split if, rather than split if up or >> down, instead, assign it the control of the zero trip guard's >> control. This way the pattern of the zero trip guard is unaffected and >> split if can proceed. I believe it's safe to assign it a later >> control: >> >> - an Opaque1 can't be shared >> >> - the zero trip guard can't be the If that's being split >> >> As Vladimir noted, this bug used to not reproduce with loop strip >> mining disabled but now always reproduces because the loop >> strip mining nest is always constructed. The reason is that the >> main loop in this test is kept alive by the LSM safepoint. If the >> LSM loop nest is not constructed, the loop is optimized out. I >> filed: >> >> https://bugs.openjdk.org/browse/JDK-8297724 >> >> for this issue. > > Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: > > - more > - more > - review That looks reasonable to me. Good idea to introduce a new opaque node type! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11391 From rcastanedalo at openjdk.org Fri Dec 2 12:51:11 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 2 Dec 2022 12:51:11 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v4] In-Reply-To: <54qqrW5YJli-O7QaH_RD0i--5UDxfLGhUn1ri7Wob6U=.31a739d7-0533-43fb-9ba5-4af28486655a@github.com> References: <54qqrW5YJli-O7QaH_RD0i--5UDxfLGhUn1ri7Wob6U=.31a739d7-0533-43fb-9ba5-4af28486655a@github.com> Message-ID: On Fri, 2 Dec 2022 12:23:13 GMT, Christian Hagedorn wrote: >> ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) >> >> During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: >> https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 >> >> Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. >> >> I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Roberto's review > > Co-authored-by: Roberto Casta?eda Lozano Marked as reviewed by rcastanedalo (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11448 From fjiang at openjdk.org Fri Dec 2 12:52:04 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 2 Dec 2022 12:52:04 GMT Subject: Integrated: 8297953: Fix several C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: <3szEl9Z46ihHsQVo6VR7p_e1xLo2luEoKBRD6QAylzU=.a0815418-ba4e-4f04-9dd0-cb64e6ff61f4@github.com> On Thu, 1 Dec 2022 13:50:23 GMT, Feilong Jiang wrote: > Fix several IR matching tests that failed on RISC-V. > > Rotate Node will be matched only when UseZbb is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java > > RISC-V does not provide float branch instruction, so we do not match CMOVEI for two floating-point comparisons: > - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java > > Testing: > - test/hotspot/jtreg/compiler/c2/irTests/TestFPComparison.java -- no tests selected as expected. > > - With `-XX:+UseZbb`: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- passed > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- passed > > - With `-XX:-UseZbb`: > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeIntIdealizationTests.java -- no tests selected as expected > - test/hotspot/jtreg/compiler/c2/irTests/RotateLeftNodeLongIdealizationTests.java -- no tests selected as expected This pull request has now been integrated. Changeset: 227364d5 Author: Feilong Jiang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/227364d5927f94764fdb84f7d0b4c88c8dc25d89 Stats: 4 lines in 4 files changed: 2 ins; 0 del; 2 mod 8297953: Fix several C2 IR matching tests for RISC-V Reviewed-by: fyang, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11453 From chagedorn at openjdk.org Fri Dec 2 12:56:15 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 2 Dec 2022 12:56:15 GMT Subject: RFR: 8297951: C2: Create skeleton predicates for all If nodes in loop predication In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:17:47 GMT, Christian Hagedorn wrote: > We currently only create skeleton predicates for `RangeCheck` nodes and not for normal `If` nodes: > https://github.com/openjdk/jdk/blob/2cb64a75578ccc15a1dfc8c2843aa11d05ca8aa7/src/hotspot/share/opto/loopPredicate.cpp#L1344-L1346 > > But it is also possible to create range check predicates in loop predication for `If` nodes if they have the right pattern checked in `PhaseIdealLoop::loop_predication_impl()` and `IdealLoopTree::is_range_check_if()`. This, however, is much more rare. > > Without skeleton predicates for these `If` nodes, we could run into the same problems already fixed for `RangeCheck` nodes (see [JDK-8193130](https://bugs.openjdk.org/browse/JDK-8193130) and related bugs). This is almost impossible to trigger in practice as it needs a very specific setup and the right optimizations to be applied. But the test case shows such a case where we hit an assert due to a broken memory graph because we are missing skeleton predicates. > > I therefore propose to always create skeleton predicates for hoisted range checks in loop predication. > > Thanks, > Christian Thanks Tobias for your review! ------------- PR: https://git.openjdk.org/jdk/pull/11454 From chagedorn at openjdk.org Fri Dec 2 12:58:23 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 2 Dec 2022 12:58:23 GMT Subject: RFR: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top [v4] In-Reply-To: <54qqrW5YJli-O7QaH_RD0i--5UDxfLGhUn1ri7Wob6U=.31a739d7-0533-43fb-9ba5-4af28486655a@github.com> References: <54qqrW5YJli-O7QaH_RD0i--5UDxfLGhUn1ri7Wob6U=.31a739d7-0533-43fb-9ba5-4af28486655a@github.com> Message-ID: On Fri, 2 Dec 2022 12:23:13 GMT, Christian Hagedorn wrote: >> ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) >> >> During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: >> https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 >> >> Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. >> >> I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Roberto's review > > Co-authored-by: Roberto Casta?eda Lozano Thanks Roberto, Tobias, and Vladimir for your reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11448 From epeter at openjdk.org Fri Dec 2 13:55:14 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 2 Dec 2022 13:55:14 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors Message-ID: **Targetted for JDK21**, since this is not a new regression, but rather an old bug. P3 because creates `SIGSEGV` in product build. The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. We would read `succ` from `_succs[1]`. https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 Then overwrite `_succs[0]` with `succ`, and shorten the array. https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 **Solution** Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). **Why did we never hit this bug before?** Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) ------------- Commit messages: - replace tabs with spaces - 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors Changes: https://git.openjdk.org/jdk/pull/11481/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11481&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296389 Stats: 171 lines in 3 files changed: 167 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11481.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11481/head:pull/11481 PR: https://git.openjdk.org/jdk/pull/11481 From jbhateja at openjdk.org Fri Dec 2 18:53:15 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 2 Dec 2022 18:53:15 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs In-Reply-To: References: Message-ID: <-4JtTu022_n66y7cn4H6Rz1jJmAPINR8bKfKC1B0zBw=.d236cb41-b96e-4573-ab8b-f4bd7fa8c507@github.com> On Fri, 2 Dec 2022 04:22:39 GMT, Smita Kamath wrote: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java line 53: > 51: } > 52: > 53: @Test New IR node checking annotations missing. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From jbhateja at openjdk.org Fri Dec 2 18:57:16 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 2 Dec 2022 18:57:16 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:22:39 GMT, Smita Kamath wrote: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita src/hotspot/share/opto/vectornode.hpp line 1630: > 1628: }; > 1629: > 1630: class HF2FVNode : public VectorNode { You may use same naming convention as used for other vector casting IR nodes VectorCastH2F and F2H ------------- PR: https://git.openjdk.org/jdk/pull/11471 From jbhateja at openjdk.org Fri Dec 2 19:24:08 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 2 Dec 2022 19:24:08 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:22:39 GMT, Smita Kamath wrote: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita src/hotspot/cpu/x86/x86.ad line 3684: > 3682: %} > 3683: > 3684: instruct vconvF2HF(vec dst, vec src) %{ We do have a destination memory flavour of VCVTPS2PH, adding a memory pattern will fold subsequent store in one instruction. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From sviswanathan at openjdk.org Fri Dec 2 21:40:17 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 2 Dec 2022 21:40:17 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:22:39 GMT, Smita Kamath wrote: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita src/hotspot/cpu/x86/x86.ad line 1688: > 1686: case Op_HF2FV: > 1687: case Op_F2HFV: > 1688: if (!VM_Version::supports_f16c() && !VM_Version::supports_avx512vl()) { We need different check for vector flavors (HF2FV/F2HV) vs the scalar flavors (ConvF2HF/ConvHF2F). The check needed for vector flavors is: if (!VM_Version::supports_f16c() && !VM_Version::supports_avx512()) { return false; } Also in vm_version_x86.cpp, the F16C features should be disabled when UseAVX is set to 0, i.e. the following if (UseAVX < 1) { _features &= ~CPU_AVX; _features &= ~CPU_VZEROUPPER; } should be updated to: if (UseAVX < 1) { _features &= ~CPU_AVX; _features &= ~CPU_VZEROUPPER; _features &= ~CPU_F16C; } src/hotspot/cpu/x86/x86.ad line 2002: > 2000: return false; > 2001: } > 2002: break; This can be removed as match_rule_supported() has previously happened. src/hotspot/cpu/x86/x86.ad line 3710: > 3708: int src_size = Matcher::vector_length_in_bytes(this, $src); > 3709: int dst_size = src_size * 2; > 3710: int vlen_enc = vector_length_encoding(dst_size); This could now be changed to: int vlen_enc = Matcher::vector_length_encoding(this); ------------- PR: https://git.openjdk.org/jdk/pull/11471 From dcubed at openjdk.org Fri Dec 2 22:30:07 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Fri, 2 Dec 2022 22:30:07 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode My jdk-20+26 stress run #1 includes these patches (as have stress runs for several earlier jdk-20 promotions) and there's no signs of issues with these patches on either linux-x64 or macosx-aarch64 stress testing. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From fyang at openjdk.org Sat Dec 3 05:44:31 2022 From: fyang at openjdk.org (Fei Yang) Date: Sat, 3 Dec 2022 05:44:31 GMT Subject: RFR: 8298055: AArch64: fastdebug build fails after JDK-8247645 Message-ID: Please review this trivial change fixing a fastdebug build failure due to warnings being treated as errors after JDK-8247645. Testing: fastdebug build fine with this change on linux-aarch64 platform. ------------- Commit messages: - 8298055: AArch64: fastdebug build fails after JDK-8247645 Changes: https://git.openjdk.org/jdk/pull/11496/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11496&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298055 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11496.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11496/head:pull/11496 PR: https://git.openjdk.org/jdk/pull/11496 From jiefu at openjdk.org Sat Dec 3 08:48:53 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sat, 3 Dec 2022 08:48:53 GMT Subject: RFR: 8298055: AArch64: fastdebug build fails after JDK-8247645 In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 05:35:51 GMT, Fei Yang wrote: > Please review this trivial change fixing a fastdebug build failure due to warnings being treated as errors after JDK-8247645. > > Testing: fastdebug build fine with this change on linux-aarch64 platform. src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2438: > 2436: default: > 2437: ShouldNotReachHere(); > 2438: Rm = 0; // unreachable Instead of adding this line, how about initiating `Rm` at declaration like `int Rm = 0;`? ------------- PR: https://git.openjdk.org/jdk/pull/11496 From aph at openjdk.org Sat Dec 3 10:23:01 2022 From: aph at openjdk.org (Andrew Haley) Date: Sat, 3 Dec 2022 10:23:01 GMT Subject: RFR: 8298055: AArch64: fastdebug build fails after JDK-8247645 In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 05:35:51 GMT, Fei Yang wrote: > Please review this trivial change fixing a fastdebug build failure due to warnings being treated as errors after JDK-8247645. > > Testing: fastdebug build fine with this change on linux-aarch64 platform. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11496 From aph at openjdk.org Sat Dec 3 10:23:02 2022 From: aph at openjdk.org (Andrew Haley) Date: Sat, 3 Dec 2022 10:23:02 GMT Subject: RFR: 8298055: AArch64: fastdebug build fails after JDK-8247645 In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 08:46:48 GMT, Jie Fu wrote: >> Please review this trivial change fixing a fastdebug build failure due to warnings being treated as errors after JDK-8247645. >> >> Testing: fastdebug build fine with this change on linux-aarch64 platform. > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2438: > >> 2436: default: >> 2437: ShouldNotReachHere(); >> 2438: Rm = 0; // unreachable > > Instead of adding this line, how about initiating `Rm` at declaration like `int Rm = 0;`? This is how it's done elsewhere in the port. Sucks, but we're stuck with it unil we find a better solution for warnings. ------------- PR: https://git.openjdk.org/jdk/pull/11496 From dcubed at openjdk.org Sun Dec 4 16:37:49 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Sun, 4 Dec 2022 16:37:49 GMT Subject: RFR: 8298068: ProblemList tests failing due to JDK-8297235 In-Reply-To: <5_3Upe1YQsdb_x-98v_3CcDPNSxPXR4EM8bN5IyHEow=.9f7b8ff4-6d19-422f-a575-22935d606f62@github.com> References: <5_3Upe1YQsdb_x-98v_3CcDPNSxPXR4EM8bN5IyHEow=.9f7b8ff4-6d19-422f-a575-22935d606f62@github.com> Message-ID: On Sun, 4 Dec 2022 16:26:06 GMT, Alexander Zvegintsev wrote: >> Batch of ProblemListings to reduce the noise in the JDK20 CI: >> >> [JDK-8298068](https://bugs.openjdk.org/browse/JDK-8298068) ProblemList tests failing due to JDK-8297235 >> [JDK-8298070](https://bugs.openjdk.org/browse/JDK-8298070) ProblemList jdk/internal/vm/Continuation/Fuzz.java#default with ZGC on X64 >> [JDK-8298071](https://bugs.openjdk.org/browse/JDK-8298071) ProblemList tests failing due to JDK-8298059 >> [JDK-8298072](https://bugs.openjdk.org/browse/JDK-8298072) ProblemList compiler/c1/TestPrintC1Statistics.java in Xcomp mode on linux-aarch64 > > Marked as reviewed by azvegint (Reviewer). @azvegint - Thanks for the lightning fast review. Especially for a Sunday AM. You reviewed before the bots were done with all their tweaking... ------------- PR: https://git.openjdk.org/jdk/pull/11501 From dcubed at openjdk.org Sun Dec 4 16:40:56 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Sun, 4 Dec 2022 16:40:56 GMT Subject: Integrated: 8298068: ProblemList tests failing due to JDK-8297235 In-Reply-To: References: Message-ID: <6c_ydji5kqiZTYV9xa6PmZf1OMnpXjelRgOuErhP7Sg=.c6b7256b-2839-4991-817a-bb373e42b381@github.com> On Sun, 4 Dec 2022 16:15:56 GMT, Daniel D. Daugherty wrote: > Batch of ProblemListings to reduce the noise in the JDK20 CI: > > [JDK-8298068](https://bugs.openjdk.org/browse/JDK-8298068) ProblemList tests failing due to JDK-8297235 > [JDK-8298070](https://bugs.openjdk.org/browse/JDK-8298070) ProblemList jdk/internal/vm/Continuation/Fuzz.java#default with ZGC on X64 > [JDK-8298071](https://bugs.openjdk.org/browse/JDK-8298071) ProblemList tests failing due to JDK-8298059 > [JDK-8298072](https://bugs.openjdk.org/browse/JDK-8298072) ProblemList compiler/c1/TestPrintC1Statistics.java in Xcomp mode on linux-aarch64 This pull request has now been integrated. Changeset: 87572d43 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/87572d43befd7d868489ba0a2cfefad5cd605ef3 Stats: 73 lines in 3 files changed: 72 ins; 0 del; 1 mod 8298068: ProblemList tests failing due to JDK-8297235 8298070: ProblemList jdk/internal/vm/Continuation/Fuzz.java#default with ZGC on X64 8298071: ProblemList tests failing due to JDK-8298059 8298072: ProblemList compiler/c1/TestPrintC1Statistics.java in Xcomp mode on linux-aarch64 Reviewed-by: azvegint ------------- PR: https://git.openjdk.org/jdk/pull/11501 From haosun at openjdk.org Mon Dec 5 00:40:30 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 5 Dec 2022 00:40:30 GMT Subject: RFR: 8298055: AArch64: fastdebug build fails after JDK-8247645 In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 05:35:51 GMT, Fei Yang wrote: > Please review this trivial change fixing a fastdebug build failure due to warnings being treated as errors after JDK-8247645. > > Testing: fastdebug build fine with this change on linux-aarch64 platform. LGTM. (I'm not a Reviewer) ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.org/jdk/pull/11496 From pli at openjdk.org Mon Dec 5 01:58:58 2022 From: pli at openjdk.org (Pengfei Li) Date: Mon, 5 Dec 2022 01:58:58 GMT Subject: RFR: 8297689: Fix incorrect result of Short.reverseBytes() call in loops In-Reply-To: References: Message-ID: <7m440EBF_ieeg_F2949KCwkMZ1Eicv0EppR7l6in--M=.874eb130-b6ff-46fe-90c5-15c848f9c369@github.com> On Wed, 30 Nov 2022 07:20:11 GMT, Pengfei Li wrote: > Recently, we find calling `Short.reverseBytes()` in loops may generate incorrect result if the code is compiled by C2. Below is a simple case to reproduce. > > > class Foo { > static final int SIZE = 50; > static int a[] = new int[SIZE]; > > static void test() { > for (int i = 0; i < SIZE; i++) { > a[i] = Short.reverseBytes((short) a[i]); > } > } > > public static void main(String[] args) throws Exception { > Class.forName("java.lang.Short"); > a[25] = 16; > test(); > System.out.println(a[25]); > } > } > > // $ java -Xint Foo > // 4096 > // $ java -Xcomp -XX:-TieredCompilation -XX:CompileOnly=Foo.test Foo > // 268435456 > > > In this case, the `reverseBytes()` call is intrinsified and transformed into a `ReverseBytesS` node. But then C2 compiler incorrectly vectorizes it into `ReverseBytesV` with int type. C2 `Op_ReverseBytes*` has short, char, int and long versions. Their behaviors are different for different data sizes. In superword, subword operation itself doesn't have precise data size info. Instead, the data size info comes from memory operations in its use-def chain. Hence, vectorization of `reverseBytes()` is valid only if the data size is consistent with the type size of the caller's class. But current C2 compiler code lacks fine-grained type checks for `ReverseBytes*` in vector transformation. It results in `reverseBytes()` call from Short or Character class with int load/store gets vectorized incorrectly in above case. > > To fix the issue, this patch adds more checks in `VectorNode::opcode()`. T_BYTE is a special case for `Op_ReverseBytes*`. As the Java Byte class doesn't have `reverseBytes()` method so there's no `Op_ReverseBytesB`. But T_BYTE may still appear in VectorAPI calls. In this patch we still use `Op_ReverseBytesI` for T_BYTE to ensure vector intrinsification succeeds. > > Tested with hotspot::hotspot_all_no_apps, jdk tier1~3 and langtools tier1 on x86 and AArch64, no issue is found. @jatin-bhateja Do you have any comments on this change? ------------- PR: https://git.openjdk.org/jdk/pull/11427 From fyang at openjdk.org Mon Dec 5 03:42:17 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 5 Dec 2022 03:42:17 GMT Subject: RFR: 8298055: AArch64: fastdebug build fails after JDK-8247645 In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 05:35:51 GMT, Fei Yang wrote: > Please review this trivial change fixing a fastdebug build failure due to warnings being treated as errors after JDK-8247645. > > Testing: fastdebug build fine with this change on linux-aarch64 platform. Thanks all for the review. ------------- PR: https://git.openjdk.org/jdk/pull/11496 From fyang at openjdk.org Mon Dec 5 03:43:58 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 5 Dec 2022 03:43:58 GMT Subject: Integrated: 8298055: AArch64: fastdebug build fails after JDK-8247645 In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 05:35:51 GMT, Fei Yang wrote: > Please review this trivial change fixing a fastdebug build failure due to warnings being treated as errors after JDK-8247645. > > Testing: fastdebug build fine with this change on linux-aarch64 platform. This pull request has now been integrated. Changeset: b49fd920 Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/b49fd920b6690a8b828c85e45c10e5c4c54d2022 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8298055: AArch64: fastdebug build fails after JDK-8247645 Reviewed-by: aph, haosun ------------- PR: https://git.openjdk.org/jdk/pull/11496 From haosun at openjdk.org Mon Dec 5 06:15:54 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 5 Dec 2022 06:15:54 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 11:43:50 GMT, Andrew Haley wrote: > Try this one: > > ``` > @Benchmark > public int compareUnsignedDirect(Blackhole bh) { > int probe1 = seed, probe2 = seed ^ seed << 5; > int sum = 0; > for (int i = 0; i < size; i++) { > probe1 ^= probe1 << 13; > sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 << 13; > probe1 ^= probe1 >>> 17; sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; > sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 >>> 17; > probe1 ^= probe1 << 5; sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; > sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 << 5; > sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; > } > seed = probe2 + probe1; > return sum; > } > ``` > > Does that help? Thanks for your comment. Regarding your example, I made minor updates, shown as below: diff --git a/test/micro/org/openjdk/bench/java/lang/Integers.java b/test/micro/org/openjdk/bench/java/lang/Integers.java index 43ceb5d18d2..5ecbee26cab 100644 --- a/test/micro/org/openjdk/bench/java/lang/Integers.java +++ b/test/micro/org/openjdk/bench/java/lang/Integers.java @@ -167,6 +167,52 @@ public class Integers { } } + @Benchmark + public void compareUnsignedIndirect3(Blackhole bh) { + int seed = intsBig[0]; + int probe1 = seed; + int probe2 = seed ^ seed << 5; + int r = 0; + for (int i = 0; i < size; i++) { + probe1 ^= probe1 << 13; + r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) <= 0 ? 1 : 0; + probe2 ^= probe2 << 13; + probe1 ^= probe1 >>> 17; + r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) <= 0 ? 1 : 0; + r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) <= 0 ? 1 : 0; + probe2 ^= probe2 >>> 17; + probe1 ^= probe1 << 5; + r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) <= 0 ? 1 : 0; + r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) <= 0 ? 1 : 0; + probe2 ^= probe2 << 5; + r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) <= 0 ? 1 : 0; + } + bh.consume(r); + } + + @Benchmark + public void compareUnsignedDirect3(Blackhole bh) { + int seed = intsBig[0]; + int probe1 = seed; + int probe2 = seed ^ seed << 5; + int r = 0; + for (int i = 0; i < size; i++) { + probe1 ^= probe1 << 13; + r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE); + probe2 ^= probe2 << 13; + probe1 ^= probe1 >>> 17; + r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE); + r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE); + probe2 ^= probe2 >>> 17; + probe1 ^= probe1 << 5; + r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE); + r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE); + probe2 ^= probe2 << 5; + r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE); + } + bh.consume(r); + } + @Benchmark public void reverseBytes() { for (int i = 0; i < size; i++) { Here shows the performance data of `compareUnsignedDirect3` on aarch64. Integer.MAX_VALUE Before After Unit compareUnsignedDirect3 2.986 ? 0.019 1.858 ? 0.008 us/op We can see using this intrinsic can introduce some performance uplifts. Regarding `Integer.compareUnsigned(XX, YY)`, we should use random values for both `XX` and `YY` in order to expose the advantage of branch predication. Hence, I made the following two versions of updates. **Version-2**: Using another constant value for `YY`, e.g., `-1`. @@ -198,17 +198,17 @@ public class Integers { int r = 0; for (int i = 0; i < size; i++) { probe1 ^= probe1 << 13; - r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE); + r += Integer.compareUnsigned(probe1, -1); probe2 ^= probe2 << 13; probe1 ^= probe1 >>> 17; - r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE); - r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE); + r += Integer.compareUnsigned(probe2, -1); + r += Integer.compareUnsigned(probe1, -1); probe2 ^= probe2 >>> 17; probe1 ^= probe1 << 5; - r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE); - r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE); + r += Integer.compareUnsigned(probe2, -1); + r += Integer.compareUnsigned(probe1, -1); probe2 ^= probe2 << 5; - r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE); + r += Integer.compareUnsigned(probe2, -1); } bh.consume(r); } **Version-3**: Using some random value for `YY`, e.g., `probe1, probe2, r`. @@ -198,17 +198,17 @@ public class Integers { int r = 0; for (int i = 0; i < size; i++) { probe1 ^= probe1 << 13; - r += Integer.compareUnsigned(probe1, -1); + r += Integer.compareUnsigned(probe1, r); probe2 ^= probe2 << 13; probe1 ^= probe1 >>> 17; - r += Integer.compareUnsigned(probe2, -1); - r += Integer.compareUnsigned(probe1, -1); + r += Integer.compareUnsigned(probe2, probe1); + r += Integer.compareUnsigned(probe1, r); probe2 ^= probe2 >>> 17; probe1 ^= probe1 << 5; - r += Integer.compareUnsigned(probe2, -1); - r += Integer.compareUnsigned(probe1, -1); + r += Integer.compareUnsigned(probe2, probe1); + r += Integer.compareUnsigned(probe1, r); probe2 ^= probe2 << 5; - r += Integer.compareUnsigned(probe2, -1); + r += Integer.compareUnsigned(probe2, probe1); } bh.consume(r); } Here shows all the performance data on aarch64. Integer.MAX_VALUE (version-1) -------------------------------------------------------------------- Before After Unit compareUnsignedDirect2 2.986 ? 0.019 1.858 ? 0.008 us/op -1 (version-2) -------------------------------------------------------------------- Before After Unit compareUnsignedDirect2 1.384 ? 0.002 1.858 ? 0.002 us/op probe1,probe2,r (version-3) -------------------------------------------------------------------- Before After Unit compareUnsignedDirect2 2.488 ? 0.076 2.517 ? 0.013 us/op Note that for **version-1 and version-2**: 1) **After** column: the data is nearly the same if using the intrinsic (), as our `cmp; cset; cneg` sequence is predictable. 2) **Before** column: the data differs a lot. IMO, the advantage of branch prediction is most utilized for **version-2**, since `-1` will be treated as the largest unsigned value, and most comparisons can be predicated. Note that for **version-3** (i.e. both `XX` and `YY` are random values): 1) **Before** column: the data may vary a bit for different runs. Sometimes it ls slightly bigger than **After** column, that is, using the intrinsic can introduce performance uplift. Sometimes, it's slightly smaller than **After** column, that is, using the intrinsic can introduce performance regression. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From chagedorn at openjdk.org Mon Dec 5 07:13:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 5 Dec 2022 07:13:00 GMT Subject: Integrated: 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 09:37:57 GMT, Christian Hagedorn wrote: > ![image](https://user-images.githubusercontent.com/17833009/205015579-1b6ac082-e992-4828-80f7-ff991964179b.png) > > During CCP, we optimize the type of `348 CastII` in `CastIINode::Value()`: It matches the `CmpI/If` pattern because the current type of `119 Phi` is a constant int: > https://github.com/openjdk/jdk/blob/9f24a6f43c6a5e1fa92275e0a87af4f1f0603ba3/src/hotspot/share/opto/castnode.cpp#L213-L215 > > Later in CCP, the type of `119 Phi` is updated and is no longer a constant but `348 CastII` is not processed anymore during CCP and keeps its wrong too narrow type. We apply more loop opts and at some point, the `CastII` is replaced by top because the input type is outside of the wrong type range of the `CastII`. Some data nodes are folded and the graph is left in a broken state and we assert during GCM. > > I propose to add a `CastII` node back to the CCP worklist if we find such a `Cmp/If` pattern to ensure that the `CastII` type is correctly set during CCP. > > Thanks, > Christian This pull request has now been integrated. Changeset: a5739239 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/a57392390b0abe5db496775efcc369bafdf420f1 Stats: 81 lines in 3 files changed: 81 ins; 0 del; 0 mod 8297264: C2: Cast node is not processed again in CCP and keeps a wrong too narrow type which is later replaced by top Reviewed-by: thartmann, rcastanedalo, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11448 From shade at openjdk.org Mon Dec 5 07:27:27 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 5 Dec 2022 07:27:27 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v7] In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: <2KN5cNitfcJDVq8qoZZuLAC5iI_H7q9wUnRXaIG7yS4=.febb959b-b25e-4bc2-b430-fa055eb4e3b5@github.com> > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Merge branch 'master' into JDK-8296545-blackhole-effects - Merge branch 'master' into JDK-8296545-blackhole-effects - Merge branch 'master' into JDK-8296545-blackhole-effects - Add comment in cfgnode.hpp - Blackhole as CFG node - Merge branch 'master' into JDK-8296545-blackhole-effects - Blackhole should be AliasIdxTop - Do not touch memory at all - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11041/files - new: https://git.openjdk.org/jdk/pull/11041/files/fd2aea6b..e847a8b5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=05-06 Stats: 19010 lines in 309 files changed: 4666 ins; 13406 del; 938 mod Patch: https://git.openjdk.org/jdk/pull/11041.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041 PR: https://git.openjdk.org/jdk/pull/11041 From haosun at openjdk.org Mon Dec 5 07:34:11 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 5 Dec 2022 07:34:11 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 17:31:36 GMT, Quan Anh Mai wrote: > The motivation for these intrinsics, aside from unsigned comparison being a fairly basic operation, is for range checks of a load/store, which has the form of `0 <= offset && offset <= length - size`, by transforming this into `0 <= length - size && offset u<= length - size`, the first comparison as well as the computation of `length - size` can be hoisted out of the loop, which results in a single operation being loop varying. The reason this may not be recognised effectively by the idealiser is that if `size` is a constant, `(length - size) + MIN_VALUE` being folded into `length + (MIN_VALUE - size)`, breaks the pattern. > @merykitty Thanks for your explanation. But I'm afraid I didn't fully get your point. I think `offset u<= length - size` will be matched with `CmpU` node, rather than `CmpU3` node, right? > @shqking I think your benchmark is not good, as the error is higher than the actual difference between samples, and the cost of the division may swallow everything inside the loop. > How about the case as suggested by aph, i.e. shown in the previous comment? Thanks~ ------------- PR: https://git.openjdk.org/jdk/pull/11383 From fyang at openjdk.org Mon Dec 5 08:49:33 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 5 Dec 2022 08:49:33 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally Message-ID: The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 It looks to me that the fix for the AArch64 port is a nice refactoring work. This fixes this issue for the RISC-V port with a similar approach. Testing: Tier1 tested with release build on linux-riscv64 unmatched board. Run non-trivial benchmark workloads with fastdebug builds. ------------- Commit messages: - 8298088: RISC-V: Make Address a discriminated union internally Changes: https://git.openjdk.org/jdk/pull/11505/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11505&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298088 Stats: 142 lines in 2 files changed: 93 ins; 11 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/11505.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11505/head:pull/11505 PR: https://git.openjdk.org/jdk/pull/11505 From qamai at openjdk.org Mon Dec 5 09:17:05 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 5 Dec 2022 09:17:05 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 07:30:08 GMT, Hao Sun wrote: >> The motivation for these intrinsics, aside from unsigned comparison being a fairly basic operation, is for range checks of a load/store, which has the form of `0 <= offset && offset <= length - size`, by transforming this into `0 <= length - size && offset u<= length - size`, the first comparison as well as the computation of `length - size` can be hoisted out of the loop, which results in a single operation being loop varying. The reason this may not be recognised effectively by the idealiser is that if `size` is a constant, `(length - size) + MIN_VALUE` being folded into `length + (MIN_VALUE - size)`, breaks the pattern. >> >> If we simply want to throw an exception in out-of-bound cases, then `Precondition::checkIndex` may suffice. This however may not be adequate if: >> >> - We want to do something else. If the hardware does not support masked load, currently we do a load followed by a blend if the whole vector is inbound and fall back out of intrinsic otherwise. >> - The bound is not provably loop-invariant, and not obviously non-negative. This may arise in `ArrayList` accesses, where bound checks are performed against the `size` field, which may need to be reloaded on each iteration and not obviously nonnegative to the compiler. >> >> IMO the direct result of the method is less important, because the contract does not have any promise with respect to the exact return value, and the only thing that can be done with it is to compare it with 0, which will certainly be folded into a `CmpU` node. >> >> @shqking I think your benchmark is not good, as the error is higher than the actual difference between samples, and the cost of the division may swallow everything inside the loop. >> >> Thanks a lot. > >> The motivation for these intrinsics, aside from unsigned comparison being a fairly basic operation, is for range checks of a load/store, which has the form of `0 <= offset && offset <= length - size`, by transforming this into `0 <= length - size && offset u<= length - size`, the first comparison as well as the computation of `length - size` can be hoisted out of the loop, which results in a single operation being loop varying. The reason this may not be recognised effectively by the idealiser is that if `size` is a constant, `(length - size) + MIN_VALUE` being folded into `length + (MIN_VALUE - size)`, breaks the pattern. >> > @merykitty Thanks for your explanation. > But I'm afraid I didn't fully get your point. > I think `offset u<= length - size` will be matched with `CmpU` node, rather than `CmpU3` node, right? > >> @shqking I think your benchmark is not good, as the error is higher than the actual difference between samples, and the cost of the division may swallow everything inside the loop. >> > How about the case as suggested by aph, i.e. shown in the previous comment? > > Thanks~ @shqking > But I'm afraid I didn't fully get your point. I think offset u<= length - size will be matched with CmpU node, rather than CmpU3 node, right? `CmpU3` serves the same purpose as `CmpL3` or `CmpD3`, that is to be an intermediate node that gets elided immediately. For example, this piece of code: static int square(long x, long y) { return x > y ? 1 : 0; } gets compiled into: static int square(long, long); 0: lload_0 1: lload_2 2: lcmp 3: ifle 10 6: iconst_1 7: goto 11 10: iconst_0 11: ireturn `lcmp` is parsed into a `CmpL3`, so the comparison part is parsed as `CmpI (CmpL3 x y) 0`, which is then transformed into `CmpL x y`. Similarly, `Integer.compareUnsigned(x, y) > 0` is parsed as `CmpI (CmpU3 x y) 0`, which is then transformed into `CmpU x y`. > How about the case as suggested by aph, i.e. shown in the previous comment? I think it is better but I still think measuring the performance of `CmpU3` node is quite irrelevent. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From aph at openjdk.org Mon Dec 5 11:28:43 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 5 Dec 2022 11:28:43 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op The problem as I see it: the intrinsic results in worse performance for the 3-way case if the result is highly predictable. And, in many cases such as bounds checking, the result will surely be highly predictable. Therefore, on average, using this intrinsic for the 3-way case may not improve things and could make them worse. Having said all of that, I believe that directly using the 3-way result may be so rare that we don't need to care about it. All that we really should optimize is the `Integer.compareUnsigned(x,y) cmp 0`. As long as that doesn't regress, I'm happy. I won't be working for the next few days, so please don't wait for any more replies from me. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From haosun at openjdk.org Mon Dec 5 11:44:09 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 5 Dec 2022 11:44:09 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 09:13:16 GMT, Quan Anh Mai wrote: >>> The motivation for these intrinsics, aside from unsigned comparison being a fairly basic operation, is for range checks of a load/store, which has the form of `0 <= offset && offset <= length - size`, by transforming this into `0 <= length - size && offset u<= length - size`, the first comparison as well as the computation of `length - size` can be hoisted out of the loop, which results in a single operation being loop varying. The reason this may not be recognised effectively by the idealiser is that if `size` is a constant, `(length - size) + MIN_VALUE` being folded into `length + (MIN_VALUE - size)`, breaks the pattern. >>> >> @merykitty Thanks for your explanation. >> But I'm afraid I didn't fully get your point. >> I think `offset u<= length - size` will be matched with `CmpU` node, rather than `CmpU3` node, right? >> >>> @shqking I think your benchmark is not good, as the error is higher than the actual difference between samples, and the cost of the division may swallow everything inside the loop. >>> >> How about the case as suggested by aph, i.e. shown in the previous comment? >> >> Thanks~ > > @shqking > >> But I'm afraid I didn't fully get your point. > I think offset u<= length - size will be matched with CmpU node, rather than CmpU3 node, right? > > `CmpU3` serves the same purpose as `CmpL3` or `CmpD3`, that is to be an intermediate node that gets elided immediately. For example, this piece of code: > > static int square(long x, long y) { > return x > y ? 1 : 0; > } > > gets compiled into: > > static int square(long, long); > 0: lload_0 > 1: lload_2 > 2: lcmp > 3: ifle 10 > 6: iconst_1 > 7: goto 11 > 10: iconst_0 > 11: ireturn > > `lcmp` is parsed into a `CmpL3`, so the comparison part is parsed as `CmpI (CmpL3 x y) 0`, which is then transformed into `CmpL x y`. > > Similarly, `Integer.compareUnsigned(x, y) > 0` is parsed as `CmpI (CmpU3 x y) 0`, which is then transformed into `CmpU x y`. > >> How about the case as suggested by aph, i.e. shown in the previous comment? > > I think it is better but I still think measuring the performance of `CmpU3` node is quite irrelevent. > > Thanks. @merykitty Thanks a lot for your prompt explanation. Here is my understanding of node `CmpU3`. 1) Generation: As an intrinsic, `Integer.compareUnsigned()` will be compiled into `CmpU3` node. The corresponding JMH test case is `compareUnsignedDirect()` function. 2) Optimization: Similar to `CmpL3`, `CmpF3` and `CmpD3` nodes, idealization optimization is conducted, transforming `CmpI (CmpU3 x y) 0` into `CmpU x y`. As you mentioned, in this case `CmpU3` node works as an intermediate node. The corresponding JMH test case is `compareUnsignedIndirect()` function. Here list two quotes from your previous comments. > IMO the direct result of the method is less important, because the contract does not have any promise with respect to the exact return value, and the only thing that can be done with it is to compare it with 0, which will certainly be folded into a CmpU node. > > I think it is better but I still think measuring the performance of CmpU3 node is quite irrelevent. > I guess your point is that it's irrelevant to evaluate the performance of `CmpU3` node mainly because **in most scenarios, the result of `Integer.compareUnsigned(x, y)` would be compared with 0**, and `CmpU3` will be optimized out finally. If my understanding is correct, may I ask how can we draw this conclusion, from observation or statistical results of the usages of `Integer.compareUnsigned(x, y)` methods in exisiting popular Java application? Besides, I believe you may understand what we (aph and I) are concerned about already, but I'd like to describe it again in case I didn't make myself understood clearly. Our discussion was actually the performance of `CmpU3` node, because in AArch64, `cmp; cset; cneg` sequence would be generated, which is predictable compared to the C2 generated code, where the advantage of branch prediction can be utilized. From the performance data shown previously, I'm afraid it's hard to say the intrinsification code is always better than C2 compiled one. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From haosun at openjdk.org Mon Dec 5 11:54:59 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 5 Dec 2022 11:54:59 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 11:25:04 GMT, Andrew Haley wrote: >> x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. >> >> Note-1: minor style issues are fixed for CmpL3 related rules. >> >> Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. >> >> Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. >> >> Following is the performance data for the JMH case: >> >> >> Before After >> Benchmark (size) Mode Cnt Score Error Score Error Units >> Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op >> Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op >> Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op >> Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op > > The problem as I see it: the intrinsic results in worse performance for the 3-way case if the result is highly predictable. > And, in many cases such as bounds checking, the result will surely be highly predictable. > Therefore, on average, using this intrinsic for the 3-way case may not improve things and could make them worse. > > Having said all of that, I believe that directly using the 3-way result may be so rare that we don't need to care about it. All that we really should optimize is the `Integer.compareUnsigned(x,y) cmp 0`. As long as that doesn't regress, I'm happy. > > I won't be working for the next few days, so please don't wait for any more replies from me. Thanks for your comment. @theRealAph I didn't see your comment until I sent out my reply to @merykitty 's comment. :) > The problem as I see it: the intrinsic results in worse performance for the 3-way case if the result is highly predictable. And, in many cases such as bounds checking, the result will surely be highly predictable. Therefore, on average, using this intrinsic for the 3-way case may not improve things and could make them worse. > Yes. It seems so. > Having said all of that, I believe that directly using the 3-way result may be so rare that we don't need to care about it. All that we really should optimize is the `Integer.compareUnsigned(x,y) cmp 0`. > Yes. I think it's also @merykitty 's point. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From shade at openjdk.org Mon Dec 5 12:03:09 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 5 Dec 2022 12:03:09 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v7] In-Reply-To: <2KN5cNitfcJDVq8qoZZuLAC5iI_H7q9wUnRXaIG7yS4=.febb959b-b25e-4bc2-b430-fa055eb4e3b5@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> <2KN5cNitfcJDVq8qoZZuLAC5iI_H7q9wUnRXaIG7yS4=.febb959b-b25e-4bc2-b430-fa055eb4e3b5@github.com> Message-ID: On Mon, 5 Dec 2022 07:27:27 GMT, Aleksey Shipilev wrote: >> If you look at generated code for the JMH benchmark like: >> >> >> public class ArrayRead { >> @Param({"1", "100", "10000", "1000000"}) >> int size; >> >> int[] is; >> >> @Setup >> public void setup() { >> is = new int[size]; >> for (int c = 0; c < size; c++) { >> is[c] = c; >> } >> } >> >> @Benchmark >> public void test(Blackhole bh) { >> for (int i = 0; i < is.length; i++) { >> bh.consume(is[i]); >> } >> } >> } >> >> >> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. >> >> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. >> >> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. >> >> Motivational improvements on the test above: >> >> >> Benchmark (size) Mode Cnt Score Error Units >> >> # Before, full Java blackholes >> ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op >> ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op >> ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op >> ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op >> >> # Before, compiler blackholes >> ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op >> ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op >> ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op >> ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op >> >> # After, compiler blackholes >> ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better >> ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better >> ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better >> ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better >> >> >> `-prof perfasm` shows the reason for these improvements clearly: >> >> Before: >> >> >> ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 >> 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d >> 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f >> ? 0x00007f0b5449836a: shl $0x3,%r10 >> 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" >> 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" >> 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 >> 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ >> 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 >> 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check >> 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d >> 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 >> >> >> After: >> >> >> >> ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read >> 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx >> 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx >> 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx >> 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx >> 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx >> 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 >> ? 0x00007fa06c49a8dc: cmp %esi,%r10d >> 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 >> >> >> Additional testing: >> - [x] Eyeballing JMH Samples `-prof perfasm` >> - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` >> - [x] Linux x86_64 fastdebug, JDK benchmark corpus > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Merge branch 'master' into JDK-8296545-blackhole-effects > - Merge branch 'master' into JDK-8296545-blackhole-effects > - Merge branch 'master' into JDK-8296545-blackhole-effects > - Add comment in cfgnode.hpp > - Blackhole as CFG node > - Merge branch 'master' into JDK-8296545-blackhole-effects > - Blackhole should be AliasIdxTop > - Do not touch memory at all > - Fix Experiments look fine. I am integrating to get this to JDK 20. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From shade at openjdk.org Mon Dec 5 12:03:10 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 5 Dec 2022 12:03:10 GMT Subject: Integrated: 8296545: C2 Blackholes should allow load optimizations In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Tue, 8 Nov 2022 15:48:01 GMT, Aleksey Shipilev wrote: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus This pull request has now been integrated. Changeset: eab0ada3 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/eab0ada3a16a432fdfd1f0b8fceca149c725451b Stats: 210 lines in 7 files changed: 167 ins; 42 del; 1 mod 8296545: C2 Blackholes should allow load optimizations Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/11041 From haosun at openjdk.org Mon Dec 5 12:06:24 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 5 Dec 2022 12:06:24 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long [v2] In-Reply-To: References: Message-ID: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op Hao Sun has updated the pull request incrementally with one additional commit since the last revision: immIAddSub is always positive As commented by aph, "immIAddSub" is always positive and we needn't check the signedness. Besides, more "comparing reg with imm" test cases are added. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11383/files - new: https://git.openjdk.org/jdk/pull/11383/files/4f727748..ef39db22 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11383&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11383&range=00-01 Stats: 81 lines in 2 files changed: 54 ins; 15 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/11383.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11383/head:pull/11383 PR: https://git.openjdk.org/jdk/pull/11383 From eosterlund at openjdk.org Mon Dec 5 13:12:13 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 5 Dec 2022 13:12:13 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 09:50:11 GMT, Axel Boldt-Christmas wrote: > Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. > > The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. > > This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. > > The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. > > There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). > > It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. > > I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: > * Is there some other way of expressing in the .ad file that a memory input should not share some register? > * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. > * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? > > Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) I think that changing memory to indirect in the mach node matching solves the conjoint register problem. The address then becomes a field, and the type of the expected value is an oop. For them to be the same register would be seemingly impossible. ------------- PR: https://git.openjdk.org/jdk/pull/11410 From xlinzheng at openjdk.org Mon Dec 5 13:24:59 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Mon, 5 Dec 2022 13:24:59 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 12:07:45 GMT, Roman Kennke wrote: >> Hi Roman, >> >> I felt there was something still vague to me, so I took another look into this issue earlier today and found another interesting thing. >> >> It seems there are two issues reflected by this PR, but of course, this PR is only doing refactoring work... awesome. >> >> The other issue is, it appears to me that [1] and [2] both lack a `cb->stubs()->maybe_expand_to_ensure_remaining();` before the `align()`s. After adding the expansion logic before the two places, failures are gone (on RISC-V). >> >> So, in summary, there are two issues here (certainly, not related to this PR - this PR just interestingly triggers and lets us spot them): >> 1. `output()->_stub_list._stubs` is always a `0` value. >> 2. the missing `cb->stubs()->maybe_expand_to_ensure_remaining();` before `align()` in the shared trampoline logic, as above-mentioned. >> >> It appears to me that we already have got the expansion logic for the two stubs [3], and the size is `2048` - enough big value to cover the sizes of the stubs. >> >> I would like to humbly suggest some solutions to it: >> 1. A quick fix is to remove the `C2CodeStubList::measure_code_size()` for it always returns a `0` now (sorry for saying this), or I guess we can use some other approaches to calculate the correct node counts of the two kinds of stubs. >> 2. I guess I might need to file another PR to solve the missing expansion logic in shared trampolines. >> >> I would like to hear what you think. >> >> Best, >> Xiaolin >> >> [1] https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/aarch64/codeBuffer_aarch64.cpp#L55 >> [2] https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/riscv/codeBuffer_riscv.cpp#L56 >> [3] https://github.com/openjdk/jdk/pull/11188/files#diff-96c31ff7167c1300458cf557427ee89af5250035ecbc2f189817c793a328a502R74 > >> Hi Roman, >> >> I felt there was something still vague to me, so I took another look into this issue earlier today and found another interesting thing. >> >> It seems there are two issues reflected by this PR, but of course, this PR is only doing refactoring work... awesome. >> >> The other issue is, it appears to me that [1] and [2] both lack a `cb->stubs()->maybe_expand_to_ensure_remaining();` before the `align()`s. After adding the expansion logic before the two places, failures are gone (on RISC-V). >> >> So, in summary, there are two issues here (certainly, not related to this PR - this PR just interestingly triggers and lets us spot them): >> >> 1. `output()->_stub_list._stubs` is always a `0` value. >> 2. the missing `cb->stubs()->maybe_expand_to_ensure_remaining();` before `align()` in the shared trampoline logic, as above-mentioned. >> >> It appears to me that we already have got the expansion logic for the two stubs [3], and the size is `2048` - enough big value to cover the sizes of the stubs. >> >> I would like to humbly suggest some solutions to it: >> >> 1. A quick fix is to remove the `C2CodeStubList::measure_code_size()` for it always returns a `0` now (sorry for saying this), or I guess we can use some other approaches to calculate the correct node counts of the two kinds of stubs. >> 2. I guess I might need to file another PR to solve the missing expansion logic in shared trampolines. >> >> I would like to hear what you think. >> >> Best, Xiaolin >> >> [1] >> >> https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/aarch64/codeBuffer_aarch64.cpp#L55 >> >> >> [2] >> https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/riscv/codeBuffer_riscv.cpp#L56 >> >> >> [3] https://github.com/openjdk/jdk/pull/11188/files#diff-96c31ff7167c1300458cf557427ee89af5250035ecbc2f189817c793a328a502R74 > > I think I understand now. You are right - when code size is 'measured' we don't have any stubs, yet. That is because the stubs only get generated while all the other assembly code is emitted, i.e. after code buffers are generated. This problem is pre-existing and C2SafepointPollStub got that part wrong before. > > However, we *do* call maybe_expand_to_ensure_remaining() before align(), that happens in C2CodeStubList::emit() before each stub gets emitted. I changed that part now to try expansion only with the amount of code that each stub requires instead of some maximum size. I'm also checking that each stub generates as much code as it reports it would. I am not sure how useful that is, tbh. But it helps to implement the size() methods (which you need to do now in RISCV). Start with implementing them to return 0, do a build, and change the value to what the check reports. > > The only other way to improve the situation is if we would first emit the whole method into the code buffer, and then measure and create a new buffer only for the stubs. It would be very small, and I don't know if it's worth the effort or if that is possible at all. WDYT? > Please let me know if that fixes your problem! Hi Roman @rkennke, I think there are no blocking issues and we are safe to move forward now; also apologies for the long thread in this PR that one may need to scroll down to see the latest messages. I previously rebased #11414 on this PR last week and native build / hotspot tier1~4 on RISC-V/AArch64 platforms seemed all passed. Would you then mind simply merging with the latest master when you are available so that we could help to test easily by re-fetching this PR? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/11188 From fjiang at openjdk.org Mon Dec 5 13:43:48 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 5 Dec 2022 13:43:48 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 08:42:22 GMT, Fei Yang wrote: > The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 > It looks to me that the fix for the AArch64 port is a nice refactoring work. > This fixes this issue for the RISC-V port with a similar approach. > > Testing: > Tier1 tested with release build on linux-riscv64 unmatched board. > Run non-trivial benchmark workloads with fastdebug builds. Nice cleanup! With one comment: src/hotspot/cpu/riscv/assembler_riscv.hpp line 33: > 31: #include "assembler_riscv.inline.hpp" > 32: #include "metaprogramming/enableIf.hpp" > 33: I think this blank line can be removed. ------------- Marked as reviewed by fjiang (Author). PR: https://git.openjdk.org/jdk/pull/11505 From fyang at openjdk.org Mon Dec 5 13:57:12 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 5 Dec 2022 13:57:12 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: > The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 > It looks to me that the fix for the AArch64 port is a nice refactoring work. > This fixes this issue for the RISC-V port with a similar approach. > > Testing: > Tier1 tested with release build on linux-riscv64 unmatched board. > Run non-trivial benchmark workloads with fastdebug builds. Fei Yang has updated the pull request incrementally with one additional commit since the last revision: Review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11505/files - new: https://git.openjdk.org/jdk/pull/11505/files/ab1b8401..3cbfe94f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11505&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11505&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11505.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11505/head:pull/11505 PR: https://git.openjdk.org/jdk/pull/11505 From fyang at openjdk.org Mon Dec 5 13:57:14 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 5 Dec 2022 13:57:14 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 13:24:46 GMT, Feilong Jiang wrote: >> Fei Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> Review > > src/hotspot/cpu/riscv/assembler_riscv.hpp line 33: > >> 31: #include "assembler_riscv.inline.hpp" >> 32: #include "metaprogramming/enableIf.hpp" >> 33: > > I think this blank line can be removed. Fixed. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11505 From jbhateja at openjdk.org Mon Dec 5 14:50:17 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 5 Dec 2022 14:50:17 GMT Subject: RFR: 8297689: Fix incorrect result of Short.reverseBytes() call in loops In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 07:20:11 GMT, Pengfei Li wrote: > Recently, we find calling `Short.reverseBytes()` in loops may generate incorrect result if the code is compiled by C2. Below is a simple case to reproduce. > > > class Foo { > static final int SIZE = 50; > static int a[] = new int[SIZE]; > > static void test() { > for (int i = 0; i < SIZE; i++) { > a[i] = Short.reverseBytes((short) a[i]); > } > } > > public static void main(String[] args) throws Exception { > Class.forName("java.lang.Short"); > a[25] = 16; > test(); > System.out.println(a[25]); > } > } > > // $ java -Xint Foo > // 4096 > // $ java -Xcomp -XX:-TieredCompilation -XX:CompileOnly=Foo.test Foo > // 268435456 > > > In this case, the `reverseBytes()` call is intrinsified and transformed into a `ReverseBytesS` node. But then C2 compiler incorrectly vectorizes it into `ReverseBytesV` with int type. C2 `Op_ReverseBytes*` has short, char, int and long versions. Their behaviors are different for different data sizes. In superword, subword operation itself doesn't have precise data size info. Instead, the data size info comes from memory operations in its use-def chain. Hence, vectorization of `reverseBytes()` is valid only if the data size is consistent with the type size of the caller's class. But current C2 compiler code lacks fine-grained type checks for `ReverseBytes*` in vector transformation. It results in `reverseBytes()` call from Short or Character class with int load/store gets vectorized incorrectly in above case. > > To fix the issue, this patch adds more checks in `VectorNode::opcode()`. T_BYTE is a special case for `Op_ReverseBytes*`. As the Java Byte class doesn't have `reverseBytes()` method so there's no `Op_ReverseBytesB`. But T_BYTE may still appear in VectorAPI calls. In this patch we still use `Op_ReverseBytesI` for T_BYTE to ensure vector intrinsification succeeds. > > Tested with hotspot::hotspot_all_no_apps, jdk tier1~3 and langtools tier1 on x86 and AArch64, no issue is found. Marked as reviewed by jbhateja (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11427 From jbhateja at openjdk.org Mon Dec 5 14:50:17 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 5 Dec 2022 14:50:17 GMT Subject: RFR: 8297689: Fix incorrect result of Short.reverseBytes() call in loops In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 14:45:49 GMT, Jatin Bhateja wrote: >> Recently, we find calling `Short.reverseBytes()` in loops may generate incorrect result if the code is compiled by C2. Below is a simple case to reproduce. >> >> >> class Foo { >> static final int SIZE = 50; >> static int a[] = new int[SIZE]; >> >> static void test() { >> for (int i = 0; i < SIZE; i++) { >> a[i] = Short.reverseBytes((short) a[i]); >> } >> } >> >> public static void main(String[] args) throws Exception { >> Class.forName("java.lang.Short"); >> a[25] = 16; >> test(); >> System.out.println(a[25]); >> } >> } >> >> // $ java -Xint Foo >> // 4096 >> // $ java -Xcomp -XX:-TieredCompilation -XX:CompileOnly=Foo.test Foo >> // 268435456 >> >> >> In this case, the `reverseBytes()` call is intrinsified and transformed into a `ReverseBytesS` node. But then C2 compiler incorrectly vectorizes it into `ReverseBytesV` with int type. C2 `Op_ReverseBytes*` has short, char, int and long versions. Their behaviors are different for different data sizes. In superword, subword operation itself doesn't have precise data size info. Instead, the data size info comes from memory operations in its use-def chain. Hence, vectorization of `reverseBytes()` is valid only if the data size is consistent with the type size of the caller's class. But current C2 compiler code lacks fine-grained type checks for `ReverseBytes*` in vector transformation. It results in `reverseBytes()` call from Short or Character class with int load/store gets vectorized incorrectly in above case. >> >> To fix the issue, this patch adds more checks in `VectorNode::opcode()`. T_BYTE is a special case for `Op_ReverseBytes*`. As the Java Byte class doesn't have `reverseBytes()` method so there's no `Op_ReverseBytesB`. But T_BYTE may still appear in VectorAPI calls. In this patch we still use `Op_ReverseBytesI` for T_BYTE to ensure vector intrinsification succeeds. >> >> Tested with hotspot::hotspot_all_no_apps, jdk tier1~3 and langtools tier1 on x86 and AArch64, no issue is found. > > Marked as reviewed by jbhateja (Reviewer). > @jatin-bhateja Do you have any comments on this change? LGTM. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11427 From kvn at openjdk.org Mon Dec 5 17:52:22 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 5 Dec 2022 17:52:22 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v12] In-Reply-To: References: Message-ID: <0tTammu5FPtjISlW9e1AwN-psRldp2qqHcfq5_1XkTA=.a7f06e26-a0c5-41f6-ab55-c374aa5780f9@github.com> On Thu, 24 Nov 2022 17:00:42 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add regression test. I tentatively approve this fix (we coming close to JDK 20 fork) with possible further improvements in next releases.. I submitted our testing. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10933 From kvn at openjdk.org Mon Dec 5 19:50:32 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 5 Dec 2022 19:50:32 GMT Subject: RFR: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In-Reply-To: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> References: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> Message-ID: On Tue, 29 Nov 2022 02:22:35 GMT, Fei Gao wrote: > Background: > > Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. > > We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. > > 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` > > In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. > > For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. > > 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform > > These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when > `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable. > > 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits > > Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when > `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: > > LOOP: > sxtw x13, w12 > add x14, x15, x13, uxtx #3 > add x17, x14, #0x10 > ld1d {z16.d}, p7/z, [x17] > // Incorrectly use integer rbit/clz insn for long type vector > *rbit z16.s, p7/m, z16.s > *clz z16.s, p7/m, z16.s > add x13, x16, x13, uxtx #2 > str q16, [x13, #16] > ... > add w12, w12, #0x20 > cmp w12, w3 > b.lt LOOP > > > It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: > > LOOP: > sxtw x13, w12 > add x14, x15, x13, uxtx #3 > add x17, x14, #0x10 > ld1d {z16.d}, p7/z, [x17] > // Compute with long vector type and convert to int vector type > *rbit z16.d, p7/m, z16.d > *clz z16.d, p7/m, z16.d > *mov z24.d, #0 > *uzp1 z25.s, z16.s, z24.s > add x13, x16, x13, uxtx #2 > str q25, [x13, #16] > ... > add w12, w12, #0x20 > cmp w12, w3 > b.lt LOOP > > > 4. Fix an assertion failure on x86 avx2 platform > > Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: > > // long[] ia; > // int[] ic; > for (int i = 0; i < LENGTH; ++i) { > ic[i] = Long.numberOfLeadingZeros(ia[i]); > } > > > X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. > > Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. > > [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) > https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) > https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) > [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 > [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 > [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239 Very nicely done. I suggest to wait approval from @TobiHartmann after his testing is finished. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11405 From kvn at openjdk.org Mon Dec 5 19:50:34 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 5 Dec 2022 19:50:34 GMT Subject: RFR: 8297951: C2: Create skeleton predicates for all If nodes in loop predication In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:17:47 GMT, Christian Hagedorn wrote: > We currently only create skeleton predicates for `RangeCheck` nodes and not for normal `If` nodes: > https://github.com/openjdk/jdk/blob/2cb64a75578ccc15a1dfc8c2843aa11d05ca8aa7/src/hotspot/share/opto/loopPredicate.cpp#L1344-L1346 > > But it is also possible to create range check predicates in loop predication for `If` nodes if they have the right pattern checked in `PhaseIdealLoop::loop_predication_impl()` and `IdealLoopTree::is_range_check_if()`. This, however, is much more rare. > > Without skeleton predicates for these `If` nodes, we could run into the same problems already fixed for `RangeCheck` nodes (see [JDK-8193130](https://bugs.openjdk.org/browse/JDK-8193130) and related bugs). This is almost impossible to trigger in practice as it needs a very specific setup and the right optimizations to be applied. But the test case shows such a case where we hit an assert due to a broken memory graph because we are missing skeleton predicates. > > I therefore propose to always create skeleton predicates for hoisted range checks in loop predication. > > Thanks, > Christian Make sense. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11454 From dlong at openjdk.org Mon Dec 5 20:36:59 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 5 Dec 2022 20:36:59 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v12] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 17:00:42 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add regression test. Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10933 From eosterlund at openjdk.org Mon Dec 5 20:46:43 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 5 Dec 2022 20:46:43 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v6] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 17:51:31 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > More RISCV fixes Looks good. Thanks for taking care of this! ------------- Marked as reviewed by eosterlund (Reviewer). PR: https://git.openjdk.org/jdk/pull/11188 From mdoerr at openjdk.org Mon Dec 5 21:13:28 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 5 Dec 2022 21:13:28 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v12] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 17:00:42 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add regression test. Thanks for the reviews and all the helpful comments and ideas! I'm planning to integrate before JDK 20 rampdown once the tests have passed. Our internal testing didn't find any further problems. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From yadongwang at openjdk.org Tue Dec 6 01:28:05 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Tue, 6 Dec 2022 01:28:05 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 13:57:12 GMT, Fei Yang wrote: >> The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 >> It looks to me that the fix for the AArch64 port is a nice refactoring work. >> This fixes this issue for the RISC-V port with a similar approach. >> >> Testing: >> Tier1 tested with release build on linux-riscv64 unmatched board. >> Run non-trivial benchmark workloads with fastdebug builds. > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Review lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/11505 From kvn at openjdk.org Tue Dec 6 01:44:00 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 01:44:00 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v12] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 17:00:42 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add regression test. New test failed on all platforms when run with '-ea -esa -XX:CompileThreshold=100 -XX:-TieredCompilation' ----------System.out:(11/726)---------- CompileCommand: compileonly null.* bool compileonly = true [0.427s][warning][codecache] CodeHeap 'non-profiled nmethods' is full. Compiler has been disabled. [0.427s][warning][codecache] Try increasing the code heap size using -XX:NonProfiledCodeHeapSize= CodeHeap 'non-profiled nmethods': size=11236Kb used=11203Kb max_used=11203Kb free=33Kb bounds [0x00007ff0bd7f9000, 0x00007ff0be2f2000, 0x00007ff0be2f2000] CodeHeap 'non-nmethods': size=5148Kb used=1630Kb max_used=1630Kb free=3517Kb bounds [0x00007ff0bd2f2000, 0x00007ff0bd562000, 0x00007ff0bd7f9000] total_blobs=932 nmethods=51 adapters=711 compilation: disabled (not enough contiguous free space left) stopped_count=1, restarted_count=0 full_count=1 ----------System.err:(14/1222)---------- Java HotSpot(TM) 64-Bit Server VM warning: CodeHeap 'non-profiled nmethods' is full. Compiler has been disabled. Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code heap size using -XX:NonProfiledCodeHeapSize= java.lang.NullPointerException: Cannot invoke "java.lang.management.MemoryPoolMXBean.getUsage()" because "" is null at compiler.codecache.MHIntrinsicAllocFailureTest.fillCodeCacheSegment(MHIntrinsicAllocFailureTest.java:60) at compiler.codecache.MHIntrinsicAllocFailureTest.main(MHIntrinsicAllocFailureTest.java:70) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:125) at java.base/java.lang.Thread.run(Thread.java:1599) ------------- PR: https://git.openjdk.org/jdk/pull/10933 From svkamath at openjdk.org Tue Dec 6 02:09:59 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 6 Dec 2022 02:09:59 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v2] In-Reply-To: References: Message-ID: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Updated code as per review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11471/files - new: https://git.openjdk.org/jdk/pull/11471/files/a102491c..8e7f884d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=00-01 Stats: 100 lines in 10 files changed: 75 ins; 12 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From fyang at openjdk.org Tue Dec 6 02:37:05 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 6 Dec 2022 02:37:05 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 13:41:30 GMT, Feilong Jiang wrote: >> Fei Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> Review > > Nice cleanup! With one comment: @feilongjiang @yadongw : Thanks for the review! @shipilev : Want to take a look? ------------- PR: https://git.openjdk.org/jdk/pull/11505 From haosun at openjdk.org Tue Dec 6 02:46:08 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 6 Dec 2022 02:46:08 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v6] In-Reply-To: References: Message-ID: <5SS6Tlhb-ZIwwl0XUcUxabrzv_sc2tWqUTebg6a0BzI=.90cdc150-a4ef-411a-bf17-d9dd9a3b30cc@github.com> On Wed, 30 Nov 2022 17:51:31 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > More RISCV fixes Kindly remind that the Oracle copyright notice should be updated to 2022 in the following files. src/hotspot/cpu/riscv/c2_CodeStubs_riscv.cpp src/hotspot/share/opto/c2_MacroAssembler.hpp src/hotspot/share/opto/output.hpp ------------- PR: https://git.openjdk.org/jdk/pull/11188 From kvn at openjdk.org Tue Dec 6 03:07:15 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 03:07:15 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v6] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 17:51:31 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > More RISCV fixes Changes looks reasonable. My main concern is hardcoded `size()` values. Debug and product VMs may emit different code. I suggest to add an assert into `emit()` to check that size matches. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11188 From fyang at openjdk.org Tue Dec 6 03:46:50 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 6 Dec 2022 03:46:50 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v6] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 17:51:31 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > More RISCV fixes src/hotspot/share/opto/c2_CodeStubs.cpp line 52: > 50: > 51: DEBUG_ONLY(int actual_size = cb.insts_size() - size_before;) > 52: assert(size == actual_size, "Expected stub size (%d) must match actual stub size (%d)", size, actual_size); Hi, I have the same concern as @vnkozlov here. For AArch64, I see C2SafepointPollStub::emit() calls MacroAssembler::far_jump() and the size of code emitted will depend on entry.target() [1]. I guess it might be better to let C2SafepointPollStub::size() return a maximum possible size and assert here that the actual size <= that maximum value. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L714 ------------- PR: https://git.openjdk.org/jdk/pull/11188 From haosun at openjdk.org Tue Dec 6 04:28:04 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 6 Dec 2022 04:28:04 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v13] In-Reply-To: <955ScdreoJQ7PG5cXUmly_giKjOJx8ouU8oy1DX_GEA=.7c59dbbb-4a3b-4f35-a951-4cf0aaa6a047@github.com> References: <955ScdreoJQ7PG5cXUmly_giKjOJx8ouU8oy1DX_GEA=.7c59dbbb-4a3b-4f35-a951-4cf0aaa6a047@github.com> Message-ID: On Tue, 29 Nov 2022 14:38:57 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 30 commits: > > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - redundant casts > - remove untaken code paths > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - fix merge problems > - Merge branch 'master' into improveVTest > - refactor x86 > - ... and 20 more: https://git.openjdk.org/jdk/compare/2f83b5c4...1fec3d30 I'm running some tests on AArch64 platform (both Neon and SVE). test/hotspot/jtreg/compiler/vectorapi/TestVectorTest.java line 31: > 29: /* > 30: * @test > 31: * @bug 8278471 A copy-paste error here? Suggestion: * @bug 8292289 ------------- PR: https://git.openjdk.org/jdk/pull/9855 From xlinzheng at openjdk.org Tue Dec 6 06:56:02 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Tue, 6 Dec 2022 06:56:02 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v6] In-Reply-To: References: Message-ID: <-zstMaSHxifINdQRN7JaO5pUFyxfKnbWLKzKgq_myiw=.dd331269-17d1-4436-afde-0fbefcec5f24@github.com> On Wed, 30 Nov 2022 17:51:31 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > More RISCV fixes The `<=` plan sounds nice to me. It seems that the `size()`s are only for expanding temp code buffers, so I think a bigger value could help, and if we can loosen them as "no longer fixed-length stubs", it would help to remove some adjustments to force making the stubs fixed-length in the RISC-V backend. Besides, we have no need to further adjust the AArch64 backend as well - the 20 for C2EntryBarrierStub is the max size it can emit. Made a simple version due to the review comments, and I'd be pleased to cover the RISC-V part. [riscv-11188-2.txt](https://github.com/openjdk/jdk/files/10163502/riscv-11188-2.txt) A review of the same diff with colors: https://github.com/zhengxiaolinX/jdk/commits/pull/11188-4 ------------- PR: https://git.openjdk.org/jdk/pull/11188 From chagedorn at openjdk.org Tue Dec 6 07:19:53 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Dec 2022 07:19:53 GMT Subject: RFR: 8297951: C2: Create skeleton predicates for all If nodes in loop predication In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:17:47 GMT, Christian Hagedorn wrote: > We currently only create skeleton predicates for `RangeCheck` nodes and not for normal `If` nodes: > https://github.com/openjdk/jdk/blob/2cb64a75578ccc15a1dfc8c2843aa11d05ca8aa7/src/hotspot/share/opto/loopPredicate.cpp#L1344-L1346 > > But it is also possible to create range check predicates in loop predication for `If` nodes if they have the right pattern checked in `PhaseIdealLoop::loop_predication_impl()` and `IdealLoopTree::is_range_check_if()`. This, however, is much more rare. > > Without skeleton predicates for these `If` nodes, we could run into the same problems already fixed for `RangeCheck` nodes (see [JDK-8193130](https://bugs.openjdk.org/browse/JDK-8193130) and related bugs). This is almost impossible to trigger in practice as it needs a very specific setup and the right optimizations to be applied. But the test case shows such a case where we hit an assert due to a broken memory graph because we are missing skeleton predicates. > > I therefore propose to always create skeleton predicates for hoisted range checks in loop predication. > > Thanks, > Christian Thank you Vladimir for your review! ------------- PR: https://git.openjdk.org/jdk/pull/11454 From chagedorn at openjdk.org Tue Dec 6 07:22:02 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Dec 2022 07:22:02 GMT Subject: Integrated: 8297951: C2: Create skeleton predicates for all If nodes in loop predication In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 14:17:47 GMT, Christian Hagedorn wrote: > We currently only create skeleton predicates for `RangeCheck` nodes and not for normal `If` nodes: > https://github.com/openjdk/jdk/blob/2cb64a75578ccc15a1dfc8c2843aa11d05ca8aa7/src/hotspot/share/opto/loopPredicate.cpp#L1344-L1346 > > But it is also possible to create range check predicates in loop predication for `If` nodes if they have the right pattern checked in `PhaseIdealLoop::loop_predication_impl()` and `IdealLoopTree::is_range_check_if()`. This, however, is much more rare. > > Without skeleton predicates for these `If` nodes, we could run into the same problems already fixed for `RangeCheck` nodes (see [JDK-8193130](https://bugs.openjdk.org/browse/JDK-8193130) and related bugs). This is almost impossible to trigger in practice as it needs a very specific setup and the right optimizations to be applied. But the test case shows such a case where we hit an assert due to a broken memory graph because we are missing skeleton predicates. > > I therefore propose to always create skeleton predicates for hoisted range checks in loop predication. > > Thanks, > Christian This pull request has now been integrated. Changeset: 0bd04a65 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/0bd04a658963c1126faa776cb8a96c23beb5e3e6 Stats: 85 lines in 2 files changed: 78 ins; 2 del; 5 mod 8297951: C2: Create skeleton predicates for all If nodes in loop predication Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11454 From epeter at openjdk.org Tue Dec 6 08:11:09 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 6 Dec 2022 08:11:09 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP Message-ID: **Targetted for JDK-21.** We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). ------------- Commit messages: - 8257197: Add additional verification code to PhaseCCP Changes: https://git.openjdk.org/jdk/pull/11529/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11529&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8257197 Stats: 54 lines in 2 files changed: 54 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11529.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11529/head:pull/11529 PR: https://git.openjdk.org/jdk/pull/11529 From haosun at openjdk.org Tue Dec 6 08:26:56 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 6 Dec 2022 08:26:56 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v13] In-Reply-To: <955ScdreoJQ7PG5cXUmly_giKjOJx8ouU8oy1DX_GEA=.7c59dbbb-4a3b-4f35-a951-4cf0aaa6a047@github.com> References: <955ScdreoJQ7PG5cXUmly_giKjOJx8ouU8oy1DX_GEA=.7c59dbbb-4a3b-4f35-a951-4cf0aaa6a047@github.com> Message-ID: On Tue, 29 Nov 2022 14:38:57 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 30 commits: > > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - redundant casts > - remove untaken code paths > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - fix merge problems > - Merge branch 'master' into improveVTest > - refactor x86 > - ... and 20 more: https://git.openjdk.org/jdk/compare/2f83b5c4...1fec3d30 test/hotspot/jtreg/compiler/vectorapi/TestVectorTest.java line 49: > 47: > 48: @Test > 49: @IR(failOn = {IRNode.CMP_I, IRNode.CMOVEI}) Should be `CMOVE_I`. Suggestion: @IR(failOn = {IRNode.CMP_I, IRNode.CMOVE_I}) test/hotspot/jtreg/compiler/vectorapi/TestVectorTest.java line 58: > 56: @Test > 57: @IR(failOn = {IRNode.CMP_I}) > 58: @IR(counts = {IRNode.VECTOR_TEST, "1", IRNode.CMOVEI, "1"}) ditto ------------- PR: https://git.openjdk.org/jdk/pull/9855 From thartmann at openjdk.org Tue Dec 6 08:46:03 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 6 Dec 2022 08:46:03 GMT Subject: RFR: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In-Reply-To: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> References: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> Message-ID: On Tue, 29 Nov 2022 02:22:35 GMT, Fei Gao wrote: > Background: > > Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. > > We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. > > 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` > > In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. > > For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. > > 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform > > These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when > `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable. > > 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits > > Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when > `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: > > LOOP: > sxtw x13, w12 > add x14, x15, x13, uxtx #3 > add x17, x14, #0x10 > ld1d {z16.d}, p7/z, [x17] > // Incorrectly use integer rbit/clz insn for long type vector > *rbit z16.s, p7/m, z16.s > *clz z16.s, p7/m, z16.s > add x13, x16, x13, uxtx #2 > str q16, [x13, #16] > ... > add w12, w12, #0x20 > cmp w12, w3 > b.lt LOOP > > > It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: > > LOOP: > sxtw x13, w12 > add x14, x15, x13, uxtx #3 > add x17, x14, #0x10 > ld1d {z16.d}, p7/z, [x17] > // Compute with long vector type and convert to int vector type > *rbit z16.d, p7/m, z16.d > *clz z16.d, p7/m, z16.d > *mov z24.d, #0 > *uzp1 z25.s, z16.s, z24.s > add x13, x16, x13, uxtx #2 > str q25, [x13, #16] > ... > add w12, w12, #0x20 > cmp w12, w3 > b.lt LOOP > > > 4. Fix an assertion failure on x86 avx2 platform > > Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: > > // long[] ia; > // int[] ic; > for (int i = 0; i < LENGTH; ++i) { > ic[i] = Long.numberOfLeadingZeros(ia[i]); > } > > > X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. > > Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. > > [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) > https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) > https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) > [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 > [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 > [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239 Looks good to me too. All tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11405 From thartmann at openjdk.org Tue Dec 6 09:09:48 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 6 Dec 2022 09:09:48 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph In-Reply-To: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> Message-ID: <0GuLRAthVakXT6Pb5SVaGKGVdLndP-FKlYhKcUE59RU=.36d08204-b71b-4160-9e19-b4741de55df9@github.com> On Thu, 1 Dec 2022 12:26:56 GMT, Christian Hagedorn wrote: > The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. > > To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. > > I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. > > Thanks, > Christian Great job in finding tests for (most of) theses cases! The fix looks reasonable to me. Since this is a regression from [JDK-8252372](https://bugs.openjdk.org/browse/JDK-8252372), the bug should have affects version 17, right? src/hotspot/share/opto/loopPredicate.cpp line 244: > 242: // Recursively find all input nodes with the same ctrl. > 243: Unique_Node_List PhaseIdealLoop::find_nodes_with_same_ctrl(Node* node, const ProjNode* ctrl) { > 244: Unique_Node_List nodes_with_same_ctrl; Did you check if there is a `ResourceMark` close by? ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11452 From pli at openjdk.org Tue Dec 6 09:16:05 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 6 Dec 2022 09:16:05 GMT Subject: RFR: 8297689: Fix incorrect result of Short.reverseBytes() call in loops In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 14:46:12 GMT, Jatin Bhateja wrote: >> Marked as reviewed by jbhateja (Reviewer). > >> @jatin-bhateja Do you have any comments on this change? > > LGTM. Thanks. Thanks @jatin-bhateja for your review. I will integrate this. ------------- PR: https://git.openjdk.org/jdk/pull/11427 From fgao at openjdk.org Tue Dec 6 09:16:10 2022 From: fgao at openjdk.org (Fei Gao) Date: Tue, 6 Dec 2022 09:16:10 GMT Subject: RFR: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In-Reply-To: References: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> Message-ID: On Mon, 5 Dec 2022 19:22:33 GMT, Vladimir Kozlov wrote: >> Background: >> >> Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. >> >> We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. >> >> 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` >> >> In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. >> >> For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. >> >> 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform >> >> These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when >> `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable. >> >> 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits >> >> Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when >> `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: >> >> LOOP: >> sxtw x13, w12 >> add x14, x15, x13, uxtx #3 >> add x17, x14, #0x10 >> ld1d {z16.d}, p7/z, [x17] >> // Incorrectly use integer rbit/clz insn for long type vector >> *rbit z16.s, p7/m, z16.s >> *clz z16.s, p7/m, z16.s >> add x13, x16, x13, uxtx #2 >> str q16, [x13, #16] >> ... >> add w12, w12, #0x20 >> cmp w12, w3 >> b.lt LOOP >> >> >> It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: >> >> LOOP: >> sxtw x13, w12 >> add x14, x15, x13, uxtx #3 >> add x17, x14, #0x10 >> ld1d {z16.d}, p7/z, [x17] >> // Compute with long vector type and convert to int vector type >> *rbit z16.d, p7/m, z16.d >> *clz z16.d, p7/m, z16.d >> *mov z24.d, #0 >> *uzp1 z25.s, z16.s, z24.s >> add x13, x16, x13, uxtx #2 >> str q25, [x13, #16] >> ... >> add w12, w12, #0x20 >> cmp w12, w3 >> b.lt LOOP >> >> >> 4. Fix an assertion failure on x86 avx2 platform >> >> Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: >> >> // long[] ia; >> // int[] ic; >> for (int i = 0; i < LENGTH; ++i) { >> ic[i] = Long.numberOfLeadingZeros(ia[i]); >> } >> >> >> X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. >> >> Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. >> >> [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) >> https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) >> https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) >> [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 >> [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 >> [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239 > > I suggest to wait approval from @TobiHartmann after his testing is finished. Thanks for your review and test work, @vnkozlov @TobiHartmann. I'll integrate it. ------------- PR: https://git.openjdk.org/jdk/pull/11405 From pli at openjdk.org Tue Dec 6 09:19:39 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 6 Dec 2022 09:19:39 GMT Subject: Integrated: 8297689: Fix incorrect result of Short.reverseBytes() call in loops In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 07:20:11 GMT, Pengfei Li wrote: > Recently, we find calling `Short.reverseBytes()` in loops may generate incorrect result if the code is compiled by C2. Below is a simple case to reproduce. > > > class Foo { > static final int SIZE = 50; > static int a[] = new int[SIZE]; > > static void test() { > for (int i = 0; i < SIZE; i++) { > a[i] = Short.reverseBytes((short) a[i]); > } > } > > public static void main(String[] args) throws Exception { > Class.forName("java.lang.Short"); > a[25] = 16; > test(); > System.out.println(a[25]); > } > } > > // $ java -Xint Foo > // 4096 > // $ java -Xcomp -XX:-TieredCompilation -XX:CompileOnly=Foo.test Foo > // 268435456 > > > In this case, the `reverseBytes()` call is intrinsified and transformed into a `ReverseBytesS` node. But then C2 compiler incorrectly vectorizes it into `ReverseBytesV` with int type. C2 `Op_ReverseBytes*` has short, char, int and long versions. Their behaviors are different for different data sizes. In superword, subword operation itself doesn't have precise data size info. Instead, the data size info comes from memory operations in its use-def chain. Hence, vectorization of `reverseBytes()` is valid only if the data size is consistent with the type size of the caller's class. But current C2 compiler code lacks fine-grained type checks for `ReverseBytes*` in vector transformation. It results in `reverseBytes()` call from Short or Character class with int load/store gets vectorized incorrectly in above case. > > To fix the issue, this patch adds more checks in `VectorNode::opcode()`. T_BYTE is a special case for `Op_ReverseBytes*`. As the Java Byte class doesn't have `reverseBytes()` method so there's no `Op_ReverseBytesB`. But T_BYTE may still appear in VectorAPI calls. In this patch we still use `Op_ReverseBytesI` for T_BYTE to ensure vector intrinsification succeeds. > > Tested with hotspot::hotspot_all_no_apps, jdk tier1~3 and langtools tier1 on x86 and AArch64, no issue is found. This pull request has now been integrated. Changeset: a6139985 Author: Pengfei Li URL: https://git.openjdk.org/jdk/commit/a61399854a9db8e3c0cb3f391fa557cb37e02571 Stats: 166 lines in 6 files changed: 160 ins; 2 del; 4 mod 8297689: Fix incorrect result of Short.reverseBytes() call in loops Reviewed-by: thartmann, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/11427 From fgao at openjdk.org Tue Dec 6 09:40:23 2022 From: fgao at openjdk.org (Fei Gao) Date: Tue, 6 Dec 2022 09:40:23 GMT Subject: Integrated: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In-Reply-To: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> References: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> Message-ID: <5DvgFNr9yCuVoVBC2D3gLhBYzzZexlQ__x3CmTYo9uw=.39fdd5ea-e464-4b7d-84a9-1c801cc59587@github.com> On Tue, 29 Nov 2022 02:22:35 GMT, Fei Gao wrote: > Background: > > Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. > > We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. > > 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` > > In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. > > For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. > > 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform > > These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when > `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable. > > 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits > > Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when > `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: > > LOOP: > sxtw x13, w12 > add x14, x15, x13, uxtx #3 > add x17, x14, #0x10 > ld1d {z16.d}, p7/z, [x17] > // Incorrectly use integer rbit/clz insn for long type vector > *rbit z16.s, p7/m, z16.s > *clz z16.s, p7/m, z16.s > add x13, x16, x13, uxtx #2 > str q16, [x13, #16] > ... > add w12, w12, #0x20 > cmp w12, w3 > b.lt LOOP > > > It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: > > LOOP: > sxtw x13, w12 > add x14, x15, x13, uxtx #3 > add x17, x14, #0x10 > ld1d {z16.d}, p7/z, [x17] > // Compute with long vector type and convert to int vector type > *rbit z16.d, p7/m, z16.d > *clz z16.d, p7/m, z16.d > *mov z24.d, #0 > *uzp1 z25.s, z16.s, z24.s > add x13, x16, x13, uxtx #2 > str q25, [x13, #16] > ... > add w12, w12, #0x20 > cmp w12, w3 > b.lt LOOP > > > 4. Fix an assertion failure on x86 avx2 platform > > Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: > > // long[] ia; > // int[] ic; > for (int i = 0; i < LENGTH; ++i) { > ic[i] = Long.numberOfLeadingZeros(ia[i]); > } > > > X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. > > Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. > > [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) > https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) > https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) > [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 > [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 > [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239 This pull request has now been integrated. Changeset: 4458de95 Author: Fei Gao Committer: Pengfei Li URL: https://git.openjdk.org/jdk/commit/4458de95f845c036c1c8e28df7043e989beaee98 Stats: 303 lines in 11 files changed: 161 ins; 131 del; 11 mod 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11405 From rkennke at openjdk.org Tue Dec 6 09:46:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 09:46:19 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v7] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Relax size-check in C2CodeStubList::emit() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/cdedf273..0a681612 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=05-06 Stats: 22 lines in 3 files changed: 5 ins; 14 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Dec 6 09:46:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 09:46:19 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v6] In-Reply-To: <-zstMaSHxifINdQRN7JaO5pUFyxfKnbWLKzKgq_myiw=.dd331269-17d1-4436-afde-0fbefcec5f24@github.com> References: <-zstMaSHxifINdQRN7JaO5pUFyxfKnbWLKzKgq_myiw=.dd331269-17d1-4436-afde-0fbefcec5f24@github.com> Message-ID: On Tue, 6 Dec 2022 06:53:59 GMT, Xiaolin Zheng wrote: > riscv-11188-2.txt That is a very reasonable change, I applied and pushed it. Thank you! @vnkozlov That check already exists, and the recent change improved it to be the maximum stub size instead of fixed size. ------------- PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Dec 6 09:51:20 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 09:51:20 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v8] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Rename C2CodeStub::size() -> max_size() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/0a681612..bfe63000 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=06-07 Stats: 15 lines in 5 files changed: 0 ins; 3 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Dec 6 09:51:22 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 09:51:22 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v7] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 09:46:19 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Relax size-check in C2CodeStubList::emit() I need somebody to run this on PPC to figure out the actual stub size for C2SafepointPollStub. @tstuefe could you try this? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/11188 From aboldtch at openjdk.org Tue Dec 6 09:57:39 2022 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Tue, 6 Dec 2022 09:57:39 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax [v2] In-Reply-To: References: Message-ID: > Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. > > The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. > > This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. > > The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. > > There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). > > It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. > > I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: > * Is there some other way of expressing in the .ad file that a memory input should not share some register? > * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. > * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? > > Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) Axel Boldt-Christmas has updated the pull request incrementally with two additional commits since the last revision: - indirect zXChgP as well - indirect alternative ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11410/files - new: https://git.openjdk.org/jdk/pull/11410/files/74f7567b..42a72c1e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11410&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11410&range=00-01 Stats: 20 lines in 1 file changed: 1 ins; 6 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/11410.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11410/head:pull/11410 PR: https://git.openjdk.org/jdk/pull/11410 From chagedorn at openjdk.org Tue Dec 6 10:05:23 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Dec 2022 10:05:23 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph In-Reply-To: <0GuLRAthVakXT6Pb5SVaGKGVdLndP-FKlYhKcUE59RU=.36d08204-b71b-4160-9e19-b4741de55df9@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> <0GuLRAthVakXT6Pb5SVaGKGVdLndP-FKlYhKcUE59RU=.36d08204-b71b-4160-9e19-b4741de55df9@github.com> Message-ID: On Tue, 6 Dec 2022 09:05:33 GMT, Tobias Hartmann wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > src/hotspot/share/opto/loopPredicate.cpp line 244: > >> 242: // Recursively find all input nodes with the same ctrl. >> 243: Unique_Node_List PhaseIdealLoop::find_nodes_with_same_ctrl(Node* node, const ProjNode* ctrl) { >> 244: Unique_Node_List nodes_with_same_ctrl; > > Did you check if there is a `ResourceMark` close by? Good point, I have not checked that. I think we could add one when calling this method in `clone_nodes_with_same_ctrl()` and in `set_ctrl_of_nodes_with_same_ctrl()`. I'll push an update. ------------- PR: https://git.openjdk.org/jdk/pull/11452 From rkennke at openjdk.org Tue Dec 6 10:07:05 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 10:07:05 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v9] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: More renames. Duh ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/bfe63000..644936d3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=07-08 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From aboldtch at openjdk.org Tue Dec 6 10:08:32 2022 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Tue, 6 Dec 2022 10:08:32 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax [v2] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 09:57:39 GMT, Axel Boldt-Christmas wrote: >> Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. >> >> The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. >> >> This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. >> >> The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. >> >> There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). >> >> It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. >> >> I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: >> * Is there some other way of expressing in the .ad file that a memory input should not share some register? >> * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. >> * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? >> >> Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) > > Axel Boldt-Christmas has updated the pull request incrementally with two additional commits since the last revision: > > - indirect zXChgP as well > - indirect alternative Changed all nodes to `indirect` memory inputs to ensure disjoint registers. ------------- PR: https://git.openjdk.org/jdk/pull/11410 From chagedorn at openjdk.org Tue Dec 6 10:08:47 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Dec 2022 10:08:47 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph In-Reply-To: <0GuLRAthVakXT6Pb5SVaGKGVdLndP-FKlYhKcUE59RU=.36d08204-b71b-4160-9e19-b4741de55df9@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> <0GuLRAthVakXT6Pb5SVaGKGVdLndP-FKlYhKcUE59RU=.36d08204-b71b-4160-9e19-b4741de55df9@github.com> Message-ID: On Tue, 6 Dec 2022 09:06:46 GMT, Tobias Hartmann wrote: > Great job in finding tests for (most of) theses cases! The fix looks reasonable to me. Thanks Tobias for your review! > Since this is a regression from [JDK-8252372](https://bugs.openjdk.org/browse/JDK-8252372), the bug should have affects version 17, right? That's right. I've updated the bug (the attached test also fails with JDK 17+35). ------------- PR: https://git.openjdk.org/jdk/pull/11452 From fyang at openjdk.org Tue Dec 6 10:19:14 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 6 Dec 2022 10:19:14 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v9] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 10:07:05 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > More renames. Duh Updated change LGTM. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11188 From chagedorn at openjdk.org Tue Dec 6 10:25:35 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Dec 2022 10:25:35 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v2] In-Reply-To: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> Message-ID: <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> > The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. > > To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. > > I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. > > Thanks, > Christian Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Add ResourceMark - Merge branch 'master' into JDK-8290850 - Fix whitespaces - 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11452/files - new: https://git.openjdk.org/jdk/pull/11452/files/42f98db5..dc1074d0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11452&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11452&range=00-01 Stats: 103803 lines in 1610 files changed: 47481 ins; 38577 del; 17745 mod Patch: https://git.openjdk.org/jdk/pull/11452.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11452/head:pull/11452 PR: https://git.openjdk.org/jdk/pull/11452 From thartmann at openjdk.org Tue Dec 6 10:34:02 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 6 Dec 2022 10:34:02 GMT Subject: RFR: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 08:11:06 GMT, Emanuel Peter wrote: > **Will hold this back until JDK21**, unless we decide it is a regression-fix for [JDK-8294217](https://bugs.openjdk.org/browse/JDK-8294217). The problem is only a not-quite-correct assert. But the problem is not limited to infinite loops, as the example below shows it can happen with reducible loops. > > **Background:** > We have an assert that checks that `has_loops` is true when it should be. If we have `has_loops == false` even though there are loops, we will not perform loop-opts in `Compile::Optimize`. > > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4285-L4293 > > Generally, we want to verify, that if we just found loops (`_ltree_root->_child != NULL`) that `has_loops == true`. > There are a few cases where we do not care if we miss loop-opts: > - We only have infinite loops (`only_has_infinite_loops()`). Infinite loops never terminate anyway, so why make them faster? Plus, a loop is only infinite if it has no loop-exit other than a `NeverBranch` exit, even uncommon traps, loop-limit checks etc are exits. Thus, if a loop does anything interesting, it probably is not such a "true infinite loop". They can be more easily forced to occur by setting `-XX:PerMethodTrapLimit=0`. > - We have only exception edges. > > Note that once we check the assert, we update `has_loops`. So if all loops disappeared, we avoid doing loop-opts henceforth. > > **Current implementation of PhaseIdealLoop::only_has_infinite_loops** > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4183-L4185 > > We check for loop exits, if there is one the loop should not be infinite. > > **The Problem** > > An infinte loop can have an inner loop, that subsequently loses its exit. It becomes its own infinite loop, and floats out of the outer loop. Where the outer loop enters into the former inner loop, we now have a loop-exit for the outer loop. The next time we run `build_loop_tree` and check the assert, it can fail, as `PhaseIdealLoop::only_has_infinite_loops` finds that new loop-exit from outer to inner loop. > > Example: `TestOnlyInfiniteLoops::test_simple` (click on images to see them larger) > > Nested infinite loop before loop-opts: > > > After `build_loop_tree`, the outer loop is detected as infinite, and `NeverBranch` is inserted. No loop is attached to loop-tree, as we do not attach newly discovered infinite loops. We will set `has_loops == false` after first loop-opts round. > > > During IGVN of first loop-opts round, some edges die. `88 IfTrue` is dominated by `52 IfTrue` (dominator info only becomes present during loop-opts). The outer loop now exits into the inner loop. > > > The second loop-opts round detects the former inner loop as an infinite loop, inserts NeverBranch. Once we run the assert, we see that we have `has_loops == false`, but `PhaseIdealLoop::only_has_infinite_loops` finds the former outer loop's exit. > > > **Solution** > If we ever only have infinite loops, then there will never be a way to get from any of those loops down to Root, except through a NeverBranch exit. So even if such an (outer) infinite loop ever has an exit, that exit cannot ever lead to Root, other than a NeverBranch exit. Thus, it is ok to still consider that loop as "infinite", even though it itself has an exit - that exit will never lead to termination. > Thus, I changed the `PhaseIdealLoop::only_has_infinite_loops` to check if any of the loops ever connect down to Root, except through NeverBranch nodes. > > **Alternative Fix** > An alternative idea to my fix here: just replace the infinite loop with a uncommon trap, and if the infinite loop is ever hit revert back to the interpreter. If we do not care to optimize infinite loops, then why even compile them? > Advantages of that idea: No need for `NeverBranch`, no need for special-casing infinite loops. > > I have another bug where assumptions are not true, because of infinite loops, and especially infinite loops not being attached to the loop-tree [JDK-8296318](https://bugs.openjdk.org/browse/JDK-8296318) > > I'm looking forward to your feedback, > Emanuel Nice summary. Looks good to me otherwise. @rwestrel should also have a look. test/hotspot/jtreg/compiler/loopopts/TestOnlyInfiniteLoopsMain.java line 38: > 36: * @compile TestOnlyInfiniteLoops.jasm > 37: * @summary Nested irreducible loops, where the inner loop floats out of the outer > 38: * @run main/othervm -XX:+UnlockExperimentalVMOptions `-XX:+UnlockExperimentalVMOptions` is not required in both `@run`, right? test/hotspot/jtreg/compiler/loopopts/TestOnlyInfiniteLoopsMain.java line 40: > 38: * @run main/othervm -XX:+UnlockExperimentalVMOptions > 39: * -XX:CompileCommand=compileonly,TestOnlyInfiniteLoops::test* > 40: * -XX:-TieredCompilation -Xbatch -Xcomp `-Xcomp` implies `-Xbatch`: https://github.com/openjdk/jdk/blob/4458de95f845c036c1c8e28df7043e989beaee98/src/hotspot/share/runtime/arguments.cpp#L1445-L1447 ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11473 From thartmann at openjdk.org Tue Dec 6 11:25:32 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 6 Dec 2022 11:25:32 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v2] In-Reply-To: <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> Message-ID: On Tue, 6 Dec 2022 10:25:35 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add ResourceMark > - Merge branch 'master' into JDK-8290850 > - Fix whitespaces > - 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph Thanks for updating. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11452 From thartmann at openjdk.org Tue Dec 6 11:39:16 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 6 Dec 2022 11:39:16 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 12:48:31 GMT, Emanuel Peter wrote: > **Targetted for JDK21**, since this is not a new regression, but rather an old bug. P3 because creates `SIGSEGV` in product build. > > The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. > > We would read `succ` from `_succs[1]`. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 > > Then overwrite `_succs[0]` with `succ`, and shorten the array. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 > > And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 > > **Solution** > Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). > > **Why did we never hit this bug before?** > Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. > Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. > > Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. > We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. > > ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) Changes requested by thartmann (Reviewer). src/hotspot/share/opto/block.cpp line 626: > 624: int end_idx = b->end_idx(); > 625: int taken_idx = b->get_node(end_idx+1)->as_Proj()->_con; > 626: ProjNode* alwaysTaken = b->get_node(end_idx + 1 + taken_idx)->as_Proj(); I find this code rather confusing. Since it's guaranteed that `alwaysTaken->_con == 0`, can't we simply do something like this? ProjNode* alwaysTaken = b->get_node(end_idx)->as_MultiBranch()->proj_out(0); Block* succ == get_block_for_node(alwaysTaken->unique_ctrl_out_or_null()); test/hotspot/jtreg/compiler/loopopts/TestPhaseCFGNeverBranchToGotoMain.java line 28: > 26: * @bug 8296389 > 27: * @summary Peeling of Irreducible loop can lead to NeverBranch being visited from either side > 28: * @run main/othervm -Xcomp -Xbatch -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 Suggestion: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 `-Xcomp` implies `-Xbatch` test/hotspot/jtreg/compiler/loopopts/TestPhaseCFGNeverBranchToGotoMain.java line 38: > 36: * @compile TestPhaseCFGNeverBranchToGoto.jasm > 37: * @summary Peeling of Irreducible loop can lead to NeverBranch being visited from either side > 38: * @run main/othervm -Xcomp -Xbatch -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 Suggestion: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 test/hotspot/jtreg/compiler/loopopts/TestPhaseCFGNeverBranchToGotoMain.java line 48: > 46: test(false, false); > 47: } > 48: public static void test(boolean flag1, boolean flag2) { Suggestion: } public static void test(boolean flag1, boolean flag2) { ------------- PR: https://git.openjdk.org/jdk/pull/11481 From rkennke at openjdk.org Tue Dec 6 11:41:12 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 11:41:12 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v10] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Update copyright notices ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/644936d3..fed41e92 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=08-09 Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From chagedorn at openjdk.org Tue Dec 6 12:04:11 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Dec 2022 12:04:11 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v2] In-Reply-To: <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> Message-ID: On Tue, 6 Dec 2022 10:25:35 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add ResourceMark > - Merge branch 'master' into JDK-8290850 > - Fix whitespaces > - 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph Thanks Tobias! ------------- PR: https://git.openjdk.org/jdk/pull/11452 From stuefe at openjdk.org Tue Dec 6 12:11:09 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 6 Dec 2022 12:11:09 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v7] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 09:46:32 GMT, Roman Kennke wrote: > I need somebody to run this on PPC to figure out the actual stub size for C2SafepointPollStub. @tstuefe could you try this? Thanks! Sorry, I am snowed in, just before vacation. @TheRealMDoerr ? ------------- PR: https://git.openjdk.org/jdk/pull/11188 From eosterlund at openjdk.org Tue Dec 6 12:55:08 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 6 Dec 2022 12:55:08 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax [v2] In-Reply-To: References: Message-ID: <-arq1zqCxRuduy61u2dK3Xw_5pqAtK4mqON9DUYOexY=.62173499-5730-4f75-a293-f7c8c1060cab@github.com> On Tue, 6 Dec 2022 09:57:39 GMT, Axel Boldt-Christmas wrote: >> Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. >> >> The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. >> >> This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. >> >> The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. >> >> There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). >> >> It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. >> >> I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: >> * Is there some other way of expressing in the .ad file that a memory input should not share some register? >> * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. >> * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? >> >> Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) > > Axel Boldt-Christmas has updated the pull request incrementally with two additional commits since the last revision: > > - indirect zXChgP as well > - indirect alternative Looks good! ------------- Marked as reviewed by eosterlund (Reviewer). PR: https://git.openjdk.org/jdk/pull/11410 From chagedorn at openjdk.org Tue Dec 6 14:11:23 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Dec 2022 14:11:23 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 08:02:12 GMT, Emanuel Peter wrote: > **Targetted for JDK-21.** > > We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. > Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. > This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. > > Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. > > FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). This is a nice additional verification! I've tried your patch out and it would have indeed found https://github.com/openjdk/jdk19/pull/65 and https://github.com/openjdk/jdk/pull/11448 (in both bugs we've missed to re-add nodes to the CCP worklist). It's a simple but powerful verification pass, so I think it is okay to add some exceptions in order to make this as stable as possible to avoid false positives/okay-to-miss optimizations. src/hotspot/share/opto/phaseX.cpp line 1829: > 1827: continue; // ignore long widen > 1828: } > 1829: } Could these two cases be merged by using the common super type `TypeInteger`, `lo/hi_as_long()` and `isa_integer(T_INT/T_LONG)`? src/hotspot/share/opto/phaseX.cpp line 1839: > 1837: n->dump_bfs(1, 0, ""); > 1838: tty->print_cr("Current type:"); > 1839: told->dump_on(tty); Just an idea: We could think about adding a small padding like 2 whitespaces before the type dump to better emphasize it. But you can also leave it like that. src/hotspot/share/opto/phaseX.cpp line 1840: > 1838: tty->print_cr("Current type:"); > 1839: told->dump_on(tty); > 1840: tty->print_cr(""); You can directly use: Suggestion: tty->cr(); src/hotspot/share/opto/phaseX.cpp line 1843: > 1841: tty->print_cr("Optimized type:"); > 1842: tnew->dump_on(tty); > 1843: tty->print_cr(""); I suggest to add another new line here to better separate in case we report multiple nodes. src/hotspot/share/opto/phaseX.cpp line 1848: > 1846: } > 1847: // If you get this assert, check if the node was notified of changes in > 1848: // the inputs. See PhaseCCP::push_child_nodes_to_worklist I suggest to rephrase this to also mention the possibility that we might found another exception that we've missed before and cannot reliably handle like `Load` nodes. This could be something like: // If we get this assert, check why the reported nodes were not processed again in CCP. // We should either make sure that these nodes are properly added back to the CCP worklist // in PhaseCCP::push_child_nodes_to_worklist() to update their type or add an exception // in the verification code above if that is not possible for some reason (like Load nodes). ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11529 From qamai at openjdk.org Tue Dec 6 14:24:39 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 6 Dec 2022 14:24:39 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: > This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: > > vptest xmm0, xmm1 > jb if_true > if_false: > > instead of: > > vptest xmm0, xmm1 > setb r10 > movzbl r10 > testl r10 > jne if_true > if_false: > > The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: > > Before After > Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change > ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% > > I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: fix test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9855/files - new: https://git.openjdk.org/jdk/pull/9855/files/1fec3d30..8d9ebed9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9855&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9855&range=12-13 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9855.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9855/head:pull/9855 PR: https://git.openjdk.org/jdk/pull/9855 From qamai at openjdk.org Tue Dec 6 14:35:17 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 6 Dec 2022 14:35:17 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v13] In-Reply-To: References: <955ScdreoJQ7PG5cXUmly_giKjOJx8ouU8oy1DX_GEA=.7c59dbbb-4a3b-4f35-a951-4cf0aaa6a047@github.com> Message-ID: On Tue, 6 Dec 2022 04:25:58 GMT, Hao Sun wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 30 commits: >> >> - Merge branch 'master' into improveVTest >> - Merge branch 'master' into improveVTest >> - redundant casts >> - remove untaken code paths >> - Merge branch 'master' into improveVTest >> - Merge branch 'master' into improveVTest >> - Merge branch 'master' into improveVTest >> - fix merge problems >> - Merge branch 'master' into improveVTest >> - refactor x86 >> - ... and 20 more: https://git.openjdk.org/jdk/compare/2f83b5c4...1fec3d30 > > I'm running some tests on AArch64 platform (both Neon and SVE). @shqking Thanks a lot for your reviews, I have addressed those in the commits ------------- PR: https://git.openjdk.org/jdk/pull/9855 From haosun at openjdk.org Tue Dec 6 14:39:05 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 6 Dec 2022 14:39:05 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix test My test passed. ### Performance testing. Here shows the data on AArch64 Neon. Before After Benchmark (prefix) (size) Mode Cnt Score Error Score Error Units Change ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 5 188228.938 ? 1304.492 188036.510 ? 1566.468 ops/ms -0.1% ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 5 50714.740 ? 251.973 52319.498 ? 73.201 ops/ms 3.2% ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 5 188.442 ? 0.829 226.975 ? 1.196 ops/ms 20.4% ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 5 107464.833 ? 7.967 107461.853 ? 17.613 ops/ms 0.0% ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 5 27873.228 ? 298.765 28854.655 ? 108.520 ops/ms 3.5% ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 5 91.318 ? 0.032 90.234 ? 0.049 ops/ms -1.2% ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 5 104424.609 ? 35.394 111375.725 ? 336.651 ops/ms 6.7% ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 5 9466.861 ? 46.815 9523.362 ? 12.216 ops/ms 0.6% ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 5 21.572 ? 0.054 22.462 ? 0.273 ops/ms 4.1% ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 5 65201.598 ? 1202.724 70891.579 ? 576.866 ops/ms 8.7% ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 5 3931.683 ? 0.432 4241.834 ? 0.531 ops/ms 7.9% ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 5 9.641 ? 0.005 10.209 ? 0.007 ops/ms 5.9% ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 5 112517.132 ? 1266.658 117607.730 ? 658.935 ops/ms 4.5% ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 5 14627.711 ? 135.210 19549.735 ? 169.208 ops/ms 33.6% ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 5 40.599 ? 0.116 45.500 ? 0.105 ops/ms 12.1% ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 5 86951.685 ? 770.519 88705.394 ? 112.681 ops/ms 2.0% ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 5 8229.636 ? 6.450 9400.670 ? 0.555 ops/ms 14.2% ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 5 20.437 ? 0.032 25.996 ? 0.142 ops/ms 27.2% ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 5 98752.429 ? 8.477 106053.731 ? 168.779 ops/ms 7.4% ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 5 9486.680 ? 113.035 9888.039 ? 10.357 ops/ms 4.2% ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 5 22.884 ? 0.118 22.469 ? 0.096 ops/ms -1.8% ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 5 72150.373 ? 441.019 71092.863 ? 746.174 ops/ms -1.5% ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 5 4604.599 ? 42.457 4690.037 ? 7.356 ops/ms 1.9% ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 5 11.147 ? 0.013 11.423 ? 0.012 ops/ms 2.5% Here shows the data on AArch64 256-bit SVE. Before After Benchmark (prefix) (size) Mode Cnt Score Error Score Error Units Change ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 5 332188.434 ? 1867.441 326994.114 ? 9458.795 ops/ms -1.6% ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 5 107444.966 ? 5050.526 100516.133 ? 1436.484 ops/ms -6.4% ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 5 440.107 ? 0.135 460.557 ? 0.276 ops/ms 4.6% ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 5 194751.414 ? 1218.965 196489.976 ? 70.422 ops/ms 0.9% ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 5 68305.755 ? 102.463 71301.912 ? 214.791 ops/ms 4.4% ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 5 213.639 ? 0.310 212.501 ? 0.200 ops/ms -0.5% ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 5 184926.046 ? 1429.361 197673.463 ? 2065.066 ops/ms 6.9% ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 5 27664.974 ? 211.233 30272.798 ? 122.976 ops/ms 9.4% ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 5 82.780 ? 0.078 72.316 ? 0.121 ops/ms -12.6% ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 5 133433.039 ? 23.047 138097.066 ? 321.764 ops/ms 3.5% ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 5 9332.847 ? 47.940 9679.395 ? 15.648 ops/ms 3.7% ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 5 25.563 ? 0.010 29.525 ? 1.410 ops/ms 15.5% ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 5 409670.146 ? 15888.302 385940.625 ? 6430.431 ops/ms -5.8% ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 5 36565.150 ? 1295.056 39837.700 ? 82.828 ops/ms 8.9% ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 5 115.997 ? 0.986 112.612 ? 0.280 ops/ms -2.9% ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 5 153095.509 ? 760.043 159605.937 ? 114.691 ops/ms 4.3% ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 5 20747.445 ? 28.624 21301.590 ? 64.918 ops/ms 2.7% ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 5 52.865 ? 0.033 53.757 ? 0.134 ops/ms 1.7% ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 5 177529.884 ? 145.103 178435.461 ? 2410.473 ops/ms 0.5% ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 5 20538.232 ? 7.532 20563.490 ? 53.205 ops/ms 0.1% ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 5 50.875 ? 0.736 52.826 ? 0.058 ops/ms 3.8% ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 5 135797.506 ? 333.638 138437.942 ? 97.186 ops/ms 1.9% ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 5 10561.460 ? 74.946 10337.813 ? 39.726 ops/ms -2.1% ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 5 26.027 ? 0.020 26.224 ? 0.046 ops/ms 0.8% I think the performance is acceptable. ### Jtreg testing 1) on AARCH64 Neon, I ran tier1~3. 2) on AArch64 SVE, I ran the cases under the following directories "test/hotspot/jtreg/compiler/vectorapi/" "test/jdk/jdk/incubator/vector/" "test/hotspot/jtreg/compiler/vectorization/" Besides the **CMOVE_I** issue in `TestVectorTest.java` as I mentioned before, all other test cases passed. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From mdoerr at openjdk.org Tue Dec 6 15:13:44 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 6 Dec 2022 15:13:44 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v13] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Test can't run when TieredCompilation is switched off. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/2c5d2839..092f9749 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=11-12 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From thartmann at openjdk.org Tue Dec 6 15:34:21 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 6 Dec 2022 15:34:21 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v13] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 15:13:44 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Test can't run when TieredCompilation is switched off. I re-submitted testing and will report back once it passed. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From bulasevich at openjdk.org Tue Dec 6 16:38:01 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 6 Dec 2022 16:38:01 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v16] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: - minor api refactoring: start_scope and roll_back instead of position and set_position - buffer() returns const array - cleanup, rename - warning fix - add test for buffer grow - adding jtreg test for CompressedSparseDataReadStream impl - align java impl to cpp impl - rewrite the SparseDataWriteStream not to use _curr_byte - introduce and call flush() excplicitly, add the gtest - minor renaming. adding encoding examples table - ... and 7 more: https://git.openjdk.org/jdk/compare/1e468320...e9269942 ------------- Changes: https://git.openjdk.org/jdk/pull/10025/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=15 Stats: 547 lines in 12 files changed: 519 ins; 0 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From mdoerr at openjdk.org Tue Dec 6 16:50:10 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 6 Dec 2022 16:50:10 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v10] In-Reply-To: References: Message-ID: <-CSf8Lzpm0Z2J_1_byi0Ob-fkqrKUu4UUVfxUkPLJOs=.1b84a044-af9d-465d-9f9a-86569871085c@github.com> On Tue, 6 Dec 2022 11:41:12 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright notices Thanks for taking care of all platforms. src/hotspot/cpu/ppc/c2_CodeStubs_ppc.cpp line 34: > 32: > 33: int C2SafepointPollStub::max_size() const { > 34: return 0; Max size is 56. ------------- PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Dec 6 17:44:54 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 17:44:54 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v11] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 32 additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8297036 - PPC fixes - Update copyright notices - More renames. Duh - Rename C2CodeStub::size() -> max_size() - Relax size-check in C2CodeStubList::emit() - More RISCV fixes - PPC fix - x86_32 fix - AArch64 parts - ... and 22 more: https://git.openjdk.org/jdk/compare/9e242cd2...b28f45d5 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/fed41e92..b28f45d5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=09-10 Stats: 127039 lines in 2017 files changed: 55993 ins; 51038 del; 20008 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From kvn at openjdk.org Tue Dec 6 18:03:11 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 18:03:11 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: <5Ygf-piuvS2hRPgsaQS1E_7uVAkoNLiijaiURdtQLmo=.1006efb4-230e-423e-bfa3-2ce263b98a69@github.com> On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9695 From kvn at openjdk.org Tue Dec 6 18:03:15 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 18:03:15 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 12:16:48 GMT, Tobias Hartmann wrote: > That makes sense, you can re-use/extend [JDK-8297307](https://bugs.openjdk.org/browse/JDK-8297307) for that. > > Vladimir should also have a look at this again. I agree with this suggestion. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From svkamath at openjdk.org Tue Dec 6 18:09:57 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 6 Dec 2022 18:09:57 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v3] In-Reply-To: References: Message-ID: <_wSDGUYkGwFXB3aWHQWdDhd84w883WSmDWLVVr7SrKo=.8202ab81-c6d2-4e26-9546-e6218781c850@github.com> > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Updated instruction definition ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11471/files - new: https://git.openjdk.org/jdk/pull/11471/files/8e7f884d..52cefb88 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=01-02 Stats: 7 lines in 3 files changed: 0 ins; 2 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Tue Dec 6 18:16:22 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 18:16:22 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v11] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 17:44:54 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 32 additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - PPC fixes > - Update copyright notices > - More renames. Duh > - Rename C2CodeStub::size() -> max_size() > - Relax size-check in C2CodeStubList::emit() > - More RISCV fixes > - PPC fix > - x86_32 fix > - AArch64 parts > - ... and 22 more: https://git.openjdk.org/jdk/compare/18233c9c...b28f45d5 This looks good now. Let me run it through our testing. ------------- PR: https://git.openjdk.org/jdk/pull/11188 From kvn at openjdk.org Tue Dec 6 18:35:11 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 18:35:11 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix test I would need to run testing again and also do our performance testing. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From sviswanathan at openjdk.org Tue Dec 6 18:41:38 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 6 Dec 2022 18:41:38 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v3] In-Reply-To: <_wSDGUYkGwFXB3aWHQWdDhd84w883WSmDWLVVr7SrKo=.8202ab81-c6d2-4e26-9546-e6218781c850@github.com> References: <_wSDGUYkGwFXB3aWHQWdDhd84w883WSmDWLVVr7SrKo=.8202ab81-c6d2-4e26-9546-e6218781c850@github.com> Message-ID: <5Vu48lKIDMQMBOC06ZGqCZnnxCYSmlndcsCAspjHl2M=.b21e153e-b609-45c8-ba89-48443360710d@github.com> On Tue, 6 Dec 2022 18:09:57 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated instruction definition src/hotspot/cpu/x86/assembler_x86.cpp line 1966: > 1964: > 1965: void Assembler::vcvtph2ps(XMMRegister dst, XMMRegister src, int vector_len) { > 1966: assert(VM_Version::supports_avx512vl() || VM_Version::supports_f16c(), ""); This should be VM_Version::supports_evex(). Also same for vcvtps2ph. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Tue Dec 6 18:42:17 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 18:42:17 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v7] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 09:46:32 GMT, Roman Kennke wrote: >> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: >> >> Relax size-check in C2CodeStubList::emit() > > I need somebody to run this on PPC to figure out the actual stub size for C2SafepointPollStub. @tstuefe could you try this? Thanks! @rkennke I got bad COPYRIGHT line error. You missed `,` after second year in c2_CodeStubs_aarch64.cpp and c2_CodeStubs_x86.cpp: * Copyright (c) 2020, 2022 Oracle and/or its affiliates. All rights reserved. ------------- PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Dec 6 18:46:12 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 18:46:12 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v12] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Fix copyrights ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/b28f45d5..e718ba6f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=10-11 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From kvn at openjdk.org Tue Dec 6 18:51:34 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 18:51:34 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v2] In-Reply-To: <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> Message-ID: On Tue, 6 Dec 2022 10:25:35 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add ResourceMark > - Merge branch 'master' into JDK-8290850 > - Fix whitespaces > - 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11452 From kvn at openjdk.org Tue Dec 6 19:03:30 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 19:03:30 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v13] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 15:13:44 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Test can't run when TieredCompilation is switched off. Tobias's testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10933 From kvn at openjdk.org Tue Dec 6 19:09:18 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 19:09:18 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v13] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 14:29:27 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with seven additional commits since the last revision: > > - change julong to uint64_t > - uint > - various fixes > - add constexpr > - add constexpr > - add message to static_assert > - missing powerOfTwo.hpp I would need more time to review it again and do new testing. I suggest to defer it until after we forked JDK 20 and target next JDK 21. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From svkamath at openjdk.org Tue Dec 6 19:16:16 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 6 Dec 2022 19:16:16 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v4] In-Reply-To: References: Message-ID: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge master - Updated instruction definition - Updated code as per review comments - Auto vectorize half precision floating point conversion APIs ------------- Changes: https://git.openjdk.org/jdk/pull/11471/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=03 Stats: 214 lines in 11 files changed: 212 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Tue Dec 6 19:58:20 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 6 Dec 2022 19:58:20 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: References: Message-ID: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Addressed review comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11471/files - new: https://git.openjdk.org/jdk/pull/11471/files/a0b4f969..09816f70 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=03-04 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From sviswanathan at openjdk.org Tue Dec 6 20:04:44 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 6 Dec 2022 20:04:44 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: References: Message-ID: <8Cz_-VUvqpn7M9Dnl75SFQ6HMAcNWGYk6VkvxQoZSeQ=.037d99b2-0e1c-46ab-85f6-e9a2a24ff55a@github.com> On Tue, 6 Dec 2022 19:58:20 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comment Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11471 From sviswanathan at openjdk.org Tue Dec 6 20:04:47 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 6 Dec 2022 20:04:47 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:26:01 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > label /hotspot @smita-kamath The patch looks good to me. You will need another review. @vnkozlov could you please help review this patch? ------------- PR: https://git.openjdk.org/jdk/pull/11471 From mdoerr at openjdk.org Tue Dec 6 20:57:07 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 6 Dec 2022 20:57:07 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v13] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 15:13:44 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Test can't run when TieredCompilation is switched off. Thanks for testing! ------------- PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Tue Dec 6 20:59:17 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 6 Dec 2022 20:59:17 GMT Subject: Integrated: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 13:13:46 GMT, Martin Doerr wrote: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. This pull request has now been integrated. Changeset: cd2182a9 Author: Martin Doerr URL: https://git.openjdk.org/jdk/commit/cd2182a9967917e733e486d918e9aeba3bd35ee8 Stats: 147 lines in 5 files changed: 144 ins; 0 del; 3 mod 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.org/jdk/pull/10933 From sspitsyn at openjdk.org Tue Dec 6 23:04:06 2022 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Tue, 6 Dec 2022 23:04:06 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> Message-ID: On Tue, 29 Nov 2022 22:30:10 GMT, Daniel D. Daugherty wrote: >> Sorry, I was not clear. >> The Fuzz.java has this order: >> >> +import jdk.test.lib.Platform; >> +import jtreg.SkippedException; >> >> I thought, you ordered imports by names. Then it is better to keep this order unified. >> It is really minor though. > > Sorry I'm still confused. As far as I can see, I've added the imports the > same way in both Fuzz.java and TestRedirectLinks.java. > > And the imports are in sort order: > 'jdk' comes before 'jtreg' and 'Platform' comes before 'SkippedException'. Sorry, copied fragment from a wrong file. This file has imports out of order: test/langtools/jdk/javadoc/doclet/testLinkOption/TestRedirectLinks.java + * @build jtreg.SkippedException + * @build jdk.test.lib.Platform ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Dec 6 23:04:07 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 6 Dec 2022 23:04:07 GMT Subject: Integrated: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode This pull request has now been integrated. Changeset: 6e547052 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/6e5470525d5236901c219146f363d4860e6b8008 Stats: 17 lines in 3 files changed: 16 ins; 0 del; 1 mod 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest 8297367: disable TestRedirectLinks.java in slowdebug mode 8297369: disable Fuzz.java in slowdebug mode Reviewed-by: sspitsyn, jjg, cjplummer, lmesnik ------------- PR: https://git.openjdk.org/jdk/pull/11278 From kvn at openjdk.org Tue Dec 6 23:30:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 23:30:52 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 19:58:20 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comment Changes are straight-forward but I have few comments. And we need to test it again. src/hotspot/cpu/x86/assembler_x86.cpp line 1958: > 1956: InstructionMark im(this); > 1957: InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /*uses_vl */ true); > 1958: attributes.set_address_attributes(/* tuple_type */ EVEX_HVM, /* input_size_in_bits */ EVEX_NObit); Is it correct to set `EVEX_*` attributes in case EVEX is switched off (by `UseAVX` flag)? src/hotspot/cpu/x86/vm_version_x86.cpp line 959: > 957: _features &= ~CPU_AVX; > 958: _features &= ~CPU_VZEROUPPER; > 959: _features &= ~CPU_F16C; Is `is_knights_family()` supports `f16c`? We switch off some avx512 features for it. But it looks like `f16c` is not connected to `avx512`. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Tue Dec 6 23:37:15 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Dec 2022 23:37:15 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 23:26:46 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comment > > src/hotspot/cpu/x86/assembler_x86.cpp line 1958: > >> 1956: InstructionMark im(this); >> 1957: InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /*uses_vl */ true); >> 1958: attributes.set_address_attributes(/* tuple_type */ EVEX_HVM, /* input_size_in_bits */ EVEX_NObit); > > Is it correct to set `EVEX_*` attributes in case EVEX is switched off (by `UseAVX` flag)? Or a CPU supports F16C but does not EVEX (avx512f). ------------- PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Tue Dec 6 23:48:25 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 6 Dec 2022 23:48:25 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v6] In-Reply-To: References: Message-ID: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Update test case ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11471/files - new: https://git.openjdk.org/jdk/pull/11471/files/09816f70..981ea9f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=04-05 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Tue Dec 6 23:50:17 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 6 Dec 2022 23:50:17 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 23:23:09 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comment > > src/hotspot/cpu/x86/vm_version_x86.cpp line 959: > >> 957: _features &= ~CPU_AVX; >> 958: _features &= ~CPU_VZEROUPPER; >> 959: _features &= ~CPU_F16C; > > Is `is_knights_family()` supports `f16c`? We switch off some avx512 features for it. But it looks like `f16c` is not connected to `avx512`. Hi Vladimir, you're correct that f16c is not connected to avx512. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Tue Dec 6 23:56:09 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 6 Dec 2022 23:56:09 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: References: Message-ID: <0tJZraVN1TpTIeH1ZfeKYZYERzLW5ZUl14ZYHIH2cmk=.4a10d3d1-6968-4f05-9846-ad250e8f9eaa@github.com> On Tue, 6 Dec 2022 23:34:40 GMT, Vladimir Kozlov wrote: >> src/hotspot/cpu/x86/assembler_x86.cpp line 1958: >> >>> 1956: InstructionMark im(this); >>> 1957: InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /*uses_vl */ true); >>> 1958: attributes.set_address_attributes(/* tuple_type */ EVEX_HVM, /* input_size_in_bits */ EVEX_NObit); >> >> Is it correct to set `EVEX_*` attributes in case EVEX is switched off (by `UseAVX` flag)? > > Or a CPU supports F16C but does not EVEX (avx512f). Hi Vladimir, we have a prior example of vpaddb instruction where these attributes are set. The assembler will ignore these attributes if UseAVX < 3. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From tsteele at openjdk.org Wed Dec 7 00:14:11 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Wed, 7 Dec 2022 00:14:11 GMT Subject: RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX Message-ID: This small change adds an import to the generated ad_ppc.cpp file to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. ------------- Depends on: https://git.openjdk.org/jdk/pull/11546 Commit messages: - Add continuation.hpp to adlc/main.cpp - Set VMContinuations to false on AIX Changes: https://git.openjdk.org/jdk/pull/11550/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11550&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298225 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11550.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11550/head:pull/11550 PR: https://git.openjdk.org/jdk/pull/11550 From jbhateja at openjdk.org Wed Dec 7 00:17:10 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 7 Dec 2022 00:17:10 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v6] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 23:48:25 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Update test case test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java line 54: > 52: > 53: @Test > 54: @IR(counts = {IRNode.VECTOR_CAST_F2H, "> 0"}, applyIfCPUFeature = {"avx512f", "true"}) You can add "FC16" also in the feature list and use applyIfCPUFeaturesOr ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 00:21:00 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 00:21:00 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: <0tJZraVN1TpTIeH1ZfeKYZYERzLW5ZUl14ZYHIH2cmk=.4a10d3d1-6968-4f05-9846-ad250e8f9eaa@github.com> References: <0tJZraVN1TpTIeH1ZfeKYZYERzLW5ZUl14ZYHIH2cmk=.4a10d3d1-6968-4f05-9846-ad250e8f9eaa@github.com> Message-ID: On Tue, 6 Dec 2022 23:53:45 GMT, Smita Kamath wrote: >> Or a CPU supports F16C but does not EVEX (avx512f). > > Hi Vladimir, we have a prior example of vpaddb instruction where these attributes are set. The assembler will ignore these attributes if UseAVX < 3. Good. Thank you for answering my questions. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From jbhateja at openjdk.org Wed Dec 7 00:21:04 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 7 Dec 2022 00:21:04 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v6] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 23:48:25 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Update test case test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java line 29: > 27: * @summary Auto-vectorize Float.floatToFloat16, Float.float16ToFloat API's > 28: * @requires vm.compiler2.enabled > 29: * @requires vm.cpu.features ~= ".*avx.*" Test may also execute on target if it has FC16, you can remove it this CPU. Feature since IR annotations already has a feature check. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From jbhateja at openjdk.org Wed Dec 7 00:24:44 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 7 Dec 2022 00:24:44 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v6] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 23:48:25 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Update test case Verified my comments addressed. IR test is enabled for AVX, but can also be enabled for FC16 since some VM features can be selectively enabled in instances. ------------- Marked as reviewed by jbhateja (Reviewer). PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 00:24:46 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 00:24:46 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v6] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 00:15:02 GMT, Jatin Bhateja wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test case > > test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java line 54: > >> 52: >> 53: @Test >> 54: @IR(counts = {IRNode.VECTOR_CAST_F2H, "> 0"}, applyIfCPUFeature = {"avx512f", "true"}) > > You can add "FC16" also in the feature list and use applyIfCPUFeaturesOr This is good suggestion. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Wed Dec 7 00:46:44 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 7 Dec 2022 00:46:44 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Updated test case ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11471/files - new: https://git.openjdk.org/jdk/pull/11471/files/981ea9f4..4b1e1270 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=05-06 Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Wed Dec 7 01:12:08 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 7 Dec 2022 01:12:08 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v5] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 23:27:55 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comment > > Changes are straight-forward but I have few comments. > > And we need to test it again. @vnkozlov I have made the requested changes. Could you please run it through your testing if the code looks good to you? Thanks a lot. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 01:27:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 01:27:05 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 00:46:44 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case I started testing for version 06. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 01:49:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 01:49:08 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: <6ZglmG06MHiOYd8prubF1p0U6epmNbaCO8d9f__10Nc=.c2abf2f9-4155-46a8-85b2-2a1e79deeb67@github.com> On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix test Regular testing passed. I am waiting performance results. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From xgong at openjdk.org Wed Dec 7 01:59:54 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 7 Dec 2022 01:59:54 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v2] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 02:09:59 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated code as per review comments src/hotspot/share/opto/vectornode.cpp line 263: > 261: return Op_SignumVD; > 262: case Op_ConvHF2F: > 263: return Op_VectorCastH2F; Could we use the same name style with the scalar op, i.e. `"Op_VectorCastHF2F"` ? src/hotspot/share/opto/vectornode.cpp line 265: > 263: return Op_VectorCastH2F; > 264: case Op_ConvF2HF: > 265: return Op_VectorCastF2H; Same with "Op_VectorCastHF2F" ------------- PR: https://git.openjdk.org/jdk/pull/11471 From xgong at openjdk.org Wed Dec 7 02:09:09 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 7 Dec 2022 02:09:09 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 00:46:44 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java line 27: > 25: * @test > 26: * @bug 8294588 > 27: * @summary Auto-vectorize Float.floatToFloat16, Float.float16ToFloat API's Change `API's` to `APIs` ? test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java line 55: > 53: @Test > 54: public void test_float_float16(short[] sout, float[] finp) { > 55: for (int i = 0; i < finp.length; i+=1) { `i+=1` => `i++` ? ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 02:36:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 02:36:04 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v12] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 18:46:12 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Fix copyrights My testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11188 From kvn at openjdk.org Wed Dec 7 02:43:15 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 02:43:15 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v2] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 01:55:23 GMT, Xiaohong Gong wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated code as per review comments > > src/hotspot/share/opto/vectornode.cpp line 263: > >> 261: return Op_SignumVD; >> 262: case Op_ConvHF2F: >> 263: return Op_VectorCastH2F; > > Could we use the same name style with the scalar op, i.e. `"Op_VectorCastHF2F"` ? I support this suggestion. Lets be consistent. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 02:45:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 02:45:04 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 00:46:44 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case @XiaohongGong suggestions should not affect results of my testing so I will not restart it. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From yyang at openjdk.org Wed Dec 7 03:11:12 2022 From: yyang at openjdk.org (Yi Yang) Date: Wed, 7 Dec 2022 03:11:12 GMT Subject: Integrated: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced In-Reply-To: References: Message-ID: On Sun, 31 Jul 2022 09:28:59 GMT, Yi Yang wrote: > Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: > > class Test { > static int dontInline() { > return 0; > } > > static long test(int val, boolean b) { > long ret = 0; > long dArr[] = new long[100]; > for (int i = 15; 293 > i; ++i) { > ret = val; > int j = 1; > while (++j < 6) { > int k = (val--); > for (long l = i; 1 > l; ) { > if (k != 0) { > ret += dontInline(); > } > } > if (b) { > break; > } > } > } > return ret; > } > > public static void main(String[] args) { > for (int i = 0; i < 1000; i++) { > test(0, false); > } > } > } > > `val` is incorrectly matched with the new parallel IV form: > ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) > And C2 further replaces it with newly added nodes, which finally leads the crash: > ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) > > I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) This pull request has now been integrated. Changeset: acf96c64 Author: Yi Yang URL: https://git.openjdk.org/jdk/commit/acf96c64b750b1a7badbb2cd1c7021dad36aae1e Stats: 111 lines in 2 files changed: 105 ins; 0 del; 6 mod 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced Reviewed-by: kvn, thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/9695 From fgao at openjdk.org Wed Dec 7 03:56:00 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 7 Dec 2022 03:56:00 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: <_fHUW-Q9BTgus_wCaVJ1_lVjB_rrf7jrUDdgPT-LYP8=.b68b8213-3c34-4e59-b7d9-0ab07f5000e3@github.com> On Wed, 7 Dec 2022 00:46:44 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case src/hotspot/share/opto/vectornode.cpp line 276: > 274: > 275: default: > 276: assert(!VectorNode::is_convert_opcode(sopc), Hi @smita-kamath, you may need pay attention to the default line here, because the patch also adds the new opcodes in the function `is_convert_opcode()` below. BTW, superword has another specialized function for cast nodes, namely `VectorCastNode::opcode()`. The patch adds the new opcodes in the function `is_convert_opcode()`, which directs the code path go to the `VectorCastNode::implemented()` and then `VectorCastNode::opcode()` when it's determining if the vector opcode is implemented, see https://github.com/openjdk/jdk/blob/ce896731d38866c2bf99cd49525062e150d94160/src/hotspot/share/opto/superword.cpp#L2072. I suppose there is no problem here. But the patch doesn't specially handle the new opcodes in these two functions. So, in fact, it's determining if the platform supports `Op_VectorCastS2X` or `Op_VectorCastF2X` rather than `Op_VectorCastH2F` or `Op_VectorCastF2H`. Coincidentally, in the stage of vector node generation, namely `SuperWord::output()`, the patch adds the new opcodes before the branch for `is_convert_opcode()` and call the function `VectorNode::opcode()`, thus generating right vector nodes. Then, the vectorization succeeds. So, I?m afraid that there may be failures in some platforms which support `Op_VectorCastS2X` o r `Op_VectorCastF2X` but does not support `Op_VectorCastH2F` or `Op_VectorCastF2H`. How about considering keeping consistent in these two stages, namely `SuperWord::implemented()` and `SuperWord::output()`. My suggestion is to keep uniform with other cast nodes. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 05:22:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 05:22:44 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 00:46:44 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case New test failed when run with `-XX:UseAVX=1`. I added output to RFE in comment. - counts: Graph contains wrong number of nodes: * Constraint 1: "(\\d+(\\s){2}(VectorCastH2F.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 > 0 [given] - No nodes matched! ------------- PR: https://git.openjdk.org/jdk/pull/11471 From roland at openjdk.org Wed Dec 7 10:16:33 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 7 Dec 2022 10:16:33 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v3] In-Reply-To: References: Message-ID: <_0gqhGLkwDMgxN1GhRorU6QiqAEROlAQNlNpG2UYv74=.77780dff-2852-4d8e-8467-f57af4d55544@github.com> > A main loop loses its pre loop. The Opaque1 node for the zero trip > guard of the main loop is assigned control at a Region through which > an If is split. As a result, the Opaque1 is cloned and the zero trip > guard takes a Phi that merges Opaque1 nodes. One of the branch dies > next and as, a result, the zero trip guard has an Opaque1 as input but > at the wrong CmpI input. The assert fires next. > > The fix I propose is that if an Opaque1 node that is part of a zero > trip guard is encountered during split if, rather than split if up or > down, instead, assign it the control of the zero trip guard's > control. This way the pattern of the zero trip guard is unaffected and > split if can proceed. I believe it's safe to assign it a later > control: > > - an Opaque1 can't be shared > > - the zero trip guard can't be the If that's being split > > As Vladimir noted, this bug used to not reproduce with loop strip > mining disabled but now always reproduces because the loop > strip mining nest is always constructed. The reason is that the > main loop in this test is kept alive by the LSM safepoint. If the > LSM loop nest is not constructed, the loop is optimized out. I > filed: > > https://bugs.openjdk.org/browse/JDK-8297724 > > for this issue. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/subnode.cpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11391/files - new: https://git.openjdk.org/jdk/pull/11391/files/26a002f5..953958ad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11391.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11391/head:pull/11391 PR: https://git.openjdk.org/jdk/pull/11391 From roland at openjdk.org Wed Dec 7 10:28:37 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 7 Dec 2022 10:28:37 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v4] In-Reply-To: References: Message-ID: > A main loop loses its pre loop. The Opaque1 node for the zero trip > guard of the main loop is assigned control at a Region through which > an If is split. As a result, the Opaque1 is cloned and the zero trip > guard takes a Phi that merges Opaque1 nodes. One of the branch dies > next and as, a result, the zero trip guard has an Opaque1 as input but > at the wrong CmpI input. The assert fires next. > > The fix I propose is that if an Opaque1 node that is part of a zero > trip guard is encountered during split if, rather than split if up or > down, instead, assign it the control of the zero trip guard's > control. This way the pattern of the zero trip guard is unaffected and > split if can proceed. I believe it's safe to assign it a later > control: > > - an Opaque1 can't be shared > > - the zero trip guard can't be the If that's being split > > As Vladimir noted, this bug used to not reproduce with loop strip > mining disabled but now always reproduces because the loop > strip mining nest is always constructed. The reason is that the > main loop in this test is kept alive by the LSM safepoint. If the > LSM loop nest is not constructed, the loop is optimized out. I > filed: > > https://bugs.openjdk.org/browse/JDK-8297724 > > for this issue. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'master' into JDK-8269820 - Update src/hotspot/share/opto/subnode.cpp Co-authored-by: Tobias Hartmann - more - more - review - more - test - more - fix ------------- Changes: https://git.openjdk.org/jdk/pull/11391/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=03 Stats: 94 lines in 8 files changed: 87 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/11391.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11391/head:pull/11391 PR: https://git.openjdk.org/jdk/pull/11391 From roland at openjdk.org Wed Dec 7 10:28:39 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 7 Dec 2022 10:28:39 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v2] In-Reply-To: References: Message-ID: On Thu, 1 Dec 2022 16:31:55 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: >> >> - more >> - more >> - review > > Looks nice! And Tobias's testing results also looks good so far (only known failures). @vnkozlov @TobiHartmann @chhagedorn thanks for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/11391 From roland at openjdk.org Wed Dec 7 10:31:24 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 7 Dec 2022 10:31:24 GMT Subject: RFR: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 08:11:06 GMT, Emanuel Peter wrote: > **Will hold this back until JDK21**, unless we decide it is a regression-fix for [JDK-8294217](https://bugs.openjdk.org/browse/JDK-8294217). The problem is only a not-quite-correct assert. But the problem is not limited to infinite loops, as the example below shows it can happen with reducible loops. > > **Background:** > We have an assert that checks that `has_loops` is true when it should be. If we have `has_loops == false` even though there are loops, we will not perform loop-opts in `Compile::Optimize`. > > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4285-L4293 > > Generally, we want to verify, that if we just found loops (`_ltree_root->_child != NULL`) that `has_loops == true`. > There are a few cases where we do not care if we miss loop-opts: > - We only have infinite loops (`only_has_infinite_loops()`). Infinite loops never terminate anyway, so why make them faster? Plus, a loop is only infinite if it has no loop-exit other than a `NeverBranch` exit, even uncommon traps, loop-limit checks etc are exits. Thus, if a loop does anything interesting, it probably is not such a "true infinite loop". They can be more easily forced to occur by setting `-XX:PerMethodTrapLimit=0`. > - We have only exception edges. > > Note that once we check the assert, we update `has_loops`. So if all loops disappeared, we avoid doing loop-opts henceforth. > > **Current implementation of PhaseIdealLoop::only_has_infinite_loops** > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4183-L4185 > > We check for loop exits, if there is one the loop should not be infinite. > > **The Problem** > > An infinte loop can have an inner loop, that subsequently loses its exit. It becomes its own infinite loop, and floats out of the outer loop. Where the outer loop enters into the former inner loop, we now have a loop-exit for the outer loop. The next time we run `build_loop_tree` and check the assert, it can fail, as `PhaseIdealLoop::only_has_infinite_loops` finds that new loop-exit from outer to inner loop. > > Example: `TestOnlyInfiniteLoops::test_simple` (click on images to see them larger) > > Nested infinite loop before loop-opts: > > > After `build_loop_tree`, the outer loop is detected as infinite, and `NeverBranch` is inserted. No loop is attached to loop-tree, as we do not attach newly discovered infinite loops. We will set `has_loops == false` after first loop-opts round. > > > During IGVN of first loop-opts round, some edges die. `88 IfTrue` is dominated by `52 IfTrue` (dominator info only becomes present during loop-opts). The outer loop now exits into the inner loop. > > > The second loop-opts round detects the former inner loop as an infinite loop, inserts NeverBranch. Once we run the assert, we see that we have `has_loops == false`, but `PhaseIdealLoop::only_has_infinite_loops` finds the former outer loop's exit. > > > **Solution** > If we ever only have infinite loops, then there will never be a way to get from any of those loops down to Root, except through a NeverBranch exit. So even if such an (outer) infinite loop ever has an exit, that exit cannot ever lead to Root, other than a NeverBranch exit. Thus, it is ok to still consider that loop as "infinite", even though it itself has an exit - that exit will never lead to termination. > Thus, I changed the `PhaseIdealLoop::only_has_infinite_loops` to check if any of the loops ever connect down to Root, except through NeverBranch nodes. > > **Alternative Fix** > An alternative idea to my fix here: just replace the infinite loop with a uncommon trap, and if the infinite loop is ever hit revert back to the interpreter. If we do not care to optimize infinite loops, then why even compile them? > Advantages of that idea: No need for `NeverBranch`, no need for special-casing infinite loops. > > I have another bug where assumptions are not true, because of infinite loops, and especially infinite loops not being attached to the loop-tree [JDK-8296318](https://bugs.openjdk.org/browse/JDK-8296318) > > I'm looking forward to your feedback, > Emanuel Looks reasonable to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk/pull/11473 From roland at openjdk.org Wed Dec 7 10:37:20 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 7 Dec 2022 10:37:20 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v5] In-Reply-To: References: Message-ID: > A main loop loses its pre loop. The Opaque1 node for the zero trip > guard of the main loop is assigned control at a Region through which > an If is split. As a result, the Opaque1 is cloned and the zero trip > guard takes a Phi that merges Opaque1 nodes. One of the branch dies > next and as, a result, the zero trip guard has an Opaque1 as input but > at the wrong CmpI input. The assert fires next. > > The fix I propose is that if an Opaque1 node that is part of a zero > trip guard is encountered during split if, rather than split if up or > down, instead, assign it the control of the zero trip guard's > control. This way the pattern of the zero trip guard is unaffected and > split if can proceed. I believe it's safe to assign it a later > control: > > - an Opaque1 can't be shared > > - the zero trip guard can't be the If that's being split > > As Vladimir noted, this bug used to not reproduce with loop strip > mining disabled but now always reproduces because the loop > strip mining nest is always constructed. The reason is that the > main loop in this test is kept alive by the LSM safepoint. If the > LSM loop nest is not constructed, the loop is optimized out. I > filed: > > https://bugs.openjdk.org/browse/JDK-8297724 > > for this issue. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: more merge ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11391/files - new: https://git.openjdk.org/jdk/pull/11391/files/43f0b53d..4d712f10 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11391.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11391/head:pull/11391 PR: https://git.openjdk.org/jdk/pull/11391 From chagedorn at openjdk.org Wed Dec 7 12:14:12 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 12:14:12 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v2] In-Reply-To: <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> Message-ID: On Tue, 6 Dec 2022 10:25:35 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add ResourceMark > - Merge branch 'master' into JDK-8290850 > - Fix whitespaces > - 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph Thanks Vladimir for your review! ------------- PR: https://git.openjdk.org/jdk/pull/11452 From roland at openjdk.org Wed Dec 7 14:19:22 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 7 Dec 2022 14:19:22 GMT Subject: Integrated: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 14:02:50 GMT, Roland Westrelin wrote: > A main loop loses its pre loop. The Opaque1 node for the zero trip > guard of the main loop is assigned control at a Region through which > an If is split. As a result, the Opaque1 is cloned and the zero trip > guard takes a Phi that merges Opaque1 nodes. One of the branch dies > next and as, a result, the zero trip guard has an Opaque1 as input but > at the wrong CmpI input. The assert fires next. > > The fix I propose is that if an Opaque1 node that is part of a zero > trip guard is encountered during split if, rather than split if up or > down, instead, assign it the control of the zero trip guard's > control. This way the pattern of the zero trip guard is unaffected and > split if can proceed. I believe it's safe to assign it a later > control: > > - an Opaque1 can't be shared > > - the zero trip guard can't be the If that's being split > > As Vladimir noted, this bug used to not reproduce with loop strip > mining disabled but now always reproduces because the loop > strip mining nest is always constructed. The reason is that the > main loop in this test is kept alive by the LSM safepoint. If the > LSM loop nest is not constructed, the loop is optimized out. I > filed: > > https://bugs.openjdk.org/browse/JDK-8297724 > > for this issue. This pull request has now been integrated. Changeset: 86270e30 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/86270e3068d3b2e80710227ae2dc79719df35788 Stats: 93 lines in 8 files changed: 86 ins; 0 del; 7 mod 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node Reviewed-by: kvn, thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11391 From thartmann at openjdk.org Wed Dec 7 15:15:08 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 7 Dec 2022 15:15:08 GMT Subject: RFR: 8298272: Clean up ProblemList Message-ID: Removed two entries from the problem list that refer to issues that were fixed/closed. Tests are running. Thanks, Tobias ------------- Commit messages: - 8298272: Clean up ProblemList Changes: https://git.openjdk.org/jdk/pull/11561/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11561&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298272 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11561.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11561/head:pull/11561 PR: https://git.openjdk.org/jdk/pull/11561 From epeter at openjdk.org Wed Dec 7 15:45:32 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 7 Dec 2022 15:45:32 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v2] In-Reply-To: References: Message-ID: <2h0U6kSKBCIV-w0Pt8Y0DO4ZuOLPqoj-LicekKn5cfo=.7973ce14-7ea0-479d-8a8f-6848d972fca4@github.com> > **Targetted for JDK-21.** > > We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. > Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. > This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. > > Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. > > FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Review suggestions from Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11529/files - new: https://git.openjdk.org/jdk/pull/11529/files/2e3cf86b..64569f73 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11529&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11529&range=00-01 Stats: 18 lines in 1 file changed: 3 ins; 5 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/11529.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11529/head:pull/11529 PR: https://git.openjdk.org/jdk/pull/11529 From chagedorn at openjdk.org Wed Dec 7 16:23:02 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 16:23:02 GMT Subject: RFR: 8298272: Clean up ProblemList In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 15:03:52 GMT, Tobias Hartmann wrote: > Removed two entries from the problem list that refer to issues that were fixed/closed. Tests are running. > > Thanks, > Tobias Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11561 From tsteele at openjdk.org Wed Dec 7 16:35:23 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Wed, 7 Dec 2022 16:35:23 GMT Subject: RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v2] In-Reply-To: References: Message-ID: <4y21VRhAMSdzhCsRu0JlbRpVR-5Wyv8pMAtYC09bFkU=.88eac3fd-6de2-4564-b650-12b313af59b0@github.com> > This small change adds an import to the generated ad_ppc.cpp file to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. Tyler Steele has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11550/files - new: https://git.openjdk.org/jdk/pull/11550/files/347a1091..347a1091 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11550&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11550&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11550.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11550/head:pull/11550 PR: https://git.openjdk.org/jdk/pull/11550 From tsteele at openjdk.org Wed Dec 7 16:56:39 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Wed, 7 Dec 2022 16:56:39 GMT Subject: RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v3] In-Reply-To: References: Message-ID: > This small change adds an import to the generated ad_ppc.cpp file to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. Tyler Steele has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into build/aix/continuation-enabled - Add continuation.hpp to adlc/main.cpp - Set VMContinuations to false on AIX - Restore 5 arg constructor for SystemProcess ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11550/files - new: https://git.openjdk.org/jdk/pull/11550/files/347a1091..5fb94181 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11550&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11550&range=01-02 Stats: 9044 lines in 271 files changed: 6068 ins; 2119 del; 857 mod Patch: https://git.openjdk.org/jdk/pull/11550.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11550/head:pull/11550 PR: https://git.openjdk.org/jdk/pull/11550 From epeter at openjdk.org Wed Dec 7 17:08:23 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 7 Dec 2022 17:08:23 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v3] In-Reply-To: References: Message-ID: > **Targetted for JDK-21.** > > We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. > Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. > This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. > > Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. > > FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8257197 - Review suggestions from Christian - 8257197: Add additional verification code to PhaseCCP ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11529/files - new: https://git.openjdk.org/jdk/pull/11529/files/64569f73..db08362b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11529&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11529&range=01-02 Stats: 29005 lines in 719 files changed: 15548 ins; 8239 del; 5218 mod Patch: https://git.openjdk.org/jdk/pull/11529.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11529/head:pull/11529 PR: https://git.openjdk.org/jdk/pull/11529 From epeter at openjdk.org Wed Dec 7 17:14:16 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 7 Dec 2022 17:14:16 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v3] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 12:32:30 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8257197 >> - Review suggestions from Christian >> - 8257197: Add additional verification code to PhaseCCP > > src/hotspot/share/opto/phaseX.cpp line 1829: > >> 1827: continue; // ignore long widen >> 1828: } >> 1829: } > > Could these two cases be merged by using the common super type `TypeInteger`, `lo/hi_as_long()` and `isa_integer(T_INT/T_LONG)`? refactored it > src/hotspot/share/opto/phaseX.cpp line 1839: > >> 1837: n->dump_bfs(1, 0, ""); >> 1838: tty->print_cr("Current type:"); >> 1839: told->dump_on(tty); > > Just an idea: We could think about adding a small padding like 2 whitespaces before the type dump to better emphasize it. But you can also leave it like that. will leave it as is. extra new-line should already help with readability > src/hotspot/share/opto/phaseX.cpp line 1840: > >> 1838: tty->print_cr("Current type:"); >> 1839: told->dump_on(tty); >> 1840: tty->print_cr(""); > > You can directly use: > Suggestion: > > tty->cr(); ? > src/hotspot/share/opto/phaseX.cpp line 1843: > >> 1841: tty->print_cr("Optimized type:"); >> 1842: tnew->dump_on(tty); >> 1843: tty->print_cr(""); > > I suggest to add another new line here to better separate in case we report multiple nodes. ? > src/hotspot/share/opto/phaseX.cpp line 1848: > >> 1846: } >> 1847: // If you get this assert, check if the node was notified of changes in >> 1848: // the inputs. See PhaseCCP::push_child_nodes_to_worklist > > I suggest to rephrase this to also mention the possibility that we might found another exception that we've missed before and cannot reliably handle like `Load` nodes. This could be something like: > > // If we get this assert, check why the reported nodes were not processed again in CCP. > // We should either make sure that these nodes are properly added back to the CCP worklist > // in PhaseCCP::push_child_nodes_to_worklist() to update their type or add an exception > // in the verification code above if that is not possible for some reason (like Load nodes). Added your text, thanks ------------- PR: https://git.openjdk.org/jdk/pull/11529 From chagedorn at openjdk.org Wed Dec 7 17:23:08 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 17:23:08 GMT Subject: RFR: 8298301: C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? Message-ID: This starts to show up in our CI in various tests after [JDK-8269820](https://bugs.openjdk.org/browse/JDK-8269820) which added a new `OpaqueZeroTripGuardNode` but forgot to update an assert which still checks for `Opaque1` instead of `OpaqueZeroTripGuard`. I've fixed that with this patch. Currently running tests: - tier1-4 Thanks, Christian ------------- Commit messages: - Update comment - C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? Changes: https://git.openjdk.org/jdk/pull/11567/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11567&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298301 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11567.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11567/head:pull/11567 PR: https://git.openjdk.org/jdk/pull/11567 From thartmann at openjdk.org Wed Dec 7 17:23:09 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 7 Dec 2022 17:23:09 GMT Subject: RFR: 8298301: C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? In-Reply-To: References: Message-ID: <5lhh0ZNp1Mg_gSAkOyGEaEUfKJX4BMswvS1tjwFoWgM=.b1959e7a-cc5a-4e3f-905d-7a3844a6c265@github.com> On Wed, 7 Dec 2022 17:03:08 GMT, Christian Hagedorn wrote: > This starts to show up in our CI in various tests after [JDK-8269820](https://bugs.openjdk.org/browse/JDK-8269820) which added a new `OpaqueZeroTripGuardNode` but forgot to update an assert which still checks for `Opaque1` instead of `OpaqueZeroTripGuard`. I've fixed that with this patch. > > Currently running tests: > - tier1-4 > > Thanks, > Christian Marked as reviewed by thartmann (Reviewer). The fix looks good and trivial. FTR, my testing for [JDK-8269820](https://bugs.openjdk.org/browse/JDK-8269820) didn't catch this because I only executed the stress job definition for the latest update (I did run full testing for v00 though). ------------- PR: https://git.openjdk.org/jdk/pull/11567 From chagedorn at openjdk.org Wed Dec 7 17:23:10 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 17:23:10 GMT Subject: RFR: 8298301: C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 17:03:08 GMT, Christian Hagedorn wrote: > This starts to show up in our CI in various tests after [JDK-8269820](https://bugs.openjdk.org/browse/JDK-8269820) which added a new `OpaqueZeroTripGuardNode` but forgot to update an assert which still checks for `Opaque1` instead of `OpaqueZeroTripGuard`. I've fixed that with this patch. > > Currently running tests: > - tier1-4 > > Thanks, > Christian Thanks Tobias for the quick review! I'll try to integrate this as soon as testing is looking (mostly) good to reduce the noise in our CI. ------------- PR: https://git.openjdk.org/jdk/pull/11567 From chagedorn at openjdk.org Wed Dec 7 17:49:27 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 17:49:27 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v3] In-Reply-To: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> Message-ID: > The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. > > To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. > > I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Revert "Add ResourceMark" This reverts commit dc1074d01b4bd52740e5e0396976232f268380e5. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11452/files - new: https://git.openjdk.org/jdk/pull/11452/files/dc1074d0..f276b3d4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11452&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11452&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11452.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11452/head:pull/11452 PR: https://git.openjdk.org/jdk/pull/11452 From chagedorn at openjdk.org Wed Dec 7 17:49:31 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 17:49:31 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v2] In-Reply-To: <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> <2pH3MQY6NAKPVlJBBwErxpjLM_ihnrKu7Z89fC_vd3E=.9868d268-55dd-4994-9103-3704e62a7755@github.com> Message-ID: <9A7awa0metjUWkNZocBxI1ymQpEOVavfgpC4_1QCwzA=.0fa346f6-e9b4-4d75-8779-c24b6cb8eb8f@github.com> On Tue, 6 Dec 2022 10:25:35 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add ResourceMark > - Merge branch 'master' into JDK-8290850 > - Fix whitespaces > - 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph I've run some testing again and noticed that we run into some assertion failures where `has_ctrl()` returns true for a CFG node which is wrong. I've had a closer look and noticed that the newly added `ResourceMarks` are a problem: We call `set_ctrl()` for the newly cloned nodes which lets the `PhaseTransform::_nodes` `Node_List` grow: https://github.com/openjdk/jdk/blob/dd7385d1e86afe8af79587e80c5046af5c84b5cd/src/hotspot/share/opto/node.cpp#L2774-L2780 We allocate a new array in the resource area because `PhaseTransform::_nodes` was allocated with `Thread::current()->resource_area()`: https://github.com/openjdk/jdk/blob/389b8f4b788375821a8bb4b017e50f905abdad2d/src/hotspot/share/opto/phaseX.cpp#L573-L576 Once we get out of the scope of the `ResourceMark`, we release the newly allocated array for `_nodes` and we will end up reading this released memory later which results in undefined behavior. I'm not sure if we have similar problems elsewhere as we have quite some places where we use `ResourceMark`. We could think about using a separate `Arena` for `PhaseTransform` for `_nodes` and `_types`. Then we could safely use `ResourceMark` and don't need to worry about whether the `_nodes` and `_types` array get reallocated in between. I'll revert my `ResourceMark` commit and run some testing again. ------------- PR: https://git.openjdk.org/jdk/pull/11452 From kvn at openjdk.org Wed Dec 7 18:00:20 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 18:00:20 GMT Subject: RFR: 8298272: Clean up ProblemList In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 15:03:52 GMT, Tobias Hartmann wrote: > Removed two entries from the problem list that refer to issues that were fixed/closed. Tests are running. > > Thanks, > Tobias Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11561 From kvn at openjdk.org Wed Dec 7 18:11:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 18:11:05 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 17:08:23 GMT, Emanuel Peter wrote: >> **Targetted for JDK-21.** >> >> We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. >> Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. >> This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. >> >> Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. >> >> FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8257197 > - Review suggestions from Christian > - 8257197: Add additional verification code to PhaseCCP Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11529 From kvn at openjdk.org Wed Dec 7 18:18:00 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 18:18:00 GMT Subject: RFR: 8298301: C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 17:03:08 GMT, Christian Hagedorn wrote: > This starts to show up in our CI in various tests after [JDK-8269820](https://bugs.openjdk.org/browse/JDK-8269820) which added a new `OpaqueZeroTripGuardNode` but forgot to update an assert which still checks for `Opaque1` instead of `OpaqueZeroTripGuard`. I've fixed that with this patch. > > Currently running tests: > - tier1-4 > > Thanks, > Christian Looks good. One more reason to rename all Opaque nodes and find places where they are used. ------------- PR: https://git.openjdk.org/jdk/pull/11567 From chagedorn at openjdk.org Wed Dec 7 18:33:08 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 18:33:08 GMT Subject: RFR: 8298301: C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 17:03:08 GMT, Christian Hagedorn wrote: > This starts to show up in our CI in various tests after [JDK-8269820](https://bugs.openjdk.org/browse/JDK-8269820) which added a new `OpaqueZeroTripGuardNode` but forgot to update an assert which still checks for `Opaque1` instead of `OpaqueZeroTripGuard`. I've fixed that with this patch. > > Currently running tests: > - tier1-4 > > Thanks, > Christian Thanks Vladimir for your review! I totally agree with that. Testing is mostly done and looks clean. I'll integrate it. ------------- PR: https://git.openjdk.org/jdk/pull/11567 From kvn at openjdk.org Wed Dec 7 18:33:11 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 18:33:11 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v3] In-Reply-To: References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> Message-ID: On Wed, 7 Dec 2022 17:49:27 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Add ResourceMark" > > This reverts commit dc1074d01b4bd52740e5e0396976232f268380e5. Update is good. Good thing you caught it. Yes, we have to be careful with ResourceMarks. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11452 From chagedorn at openjdk.org Wed Dec 7 18:36:59 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Dec 2022 18:36:59 GMT Subject: Integrated: 8298301: C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 17:03:08 GMT, Christian Hagedorn wrote: > This starts to show up in our CI in various tests after [JDK-8269820](https://bugs.openjdk.org/browse/JDK-8269820) which added a new `OpaqueZeroTripGuardNode` but forgot to update an assert which still checks for `Opaque1` instead of `OpaqueZeroTripGuard`. I've fixed that with this patch. > > Currently running tests: > - tier1-4 > > Thanks, > Christian This pull request has now been integrated. Changeset: e86f31b5 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/e86f31b5e71af00fea9cd989a86c1e75e3df1821 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8298301: C2: assert(main_cmp->in(2)->Opcode() == Op_Opaque1) failed: main loop has no opaque node? Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11567 From kvn at openjdk.org Wed Dec 7 18:47:16 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 18:47:16 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix test Rerunning DaCapo and Renaissance which shows some variations. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From dchuyko at openjdk.org Wed Dec 7 18:58:01 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 7 Dec 2022 18:58:01 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode Message-ID: This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern inroduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have an ' immI_M1' input. New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test is can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also shows the changed code with `-prof perfasm`. Typical nanobenchmark with a loop with a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usualy there are enought registers. However special nano-benchmarks can be considered, e.g. @Benchmark @OperationsPerInvocation(TESTSIZE) public int max0_use8_i() { int sum = 0; for(int i = 0; i < TESTSIZE; i++) { use8(0, 1, 2, 3, 4, 5, 6, 7); sum += Math.max(i, 0); } return sum; } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { } Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms. Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. ------------- Commit messages: - JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode Changes: https://git.openjdk.org/jdk/pull/11570/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11570&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8153837 Stats: 300 lines in 3 files changed: 294 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11570.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11570/head:pull/11570 PR: https://git.openjdk.org/jdk/pull/11570 From aph at openjdk.org Wed Dec 7 18:58:03 2022 From: aph at openjdk.org (Andrew Haley) Date: Wed, 7 Dec 2022 18:58:03 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 18:42:52 GMT, Dmitry Chuyko wrote: > This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html > > In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. > > The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern inroduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have an 'immI_M1' input. > > New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test is can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also shows the changed code with `-prof perfasm`. > > Typical nanobenchmark with a loop with a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usualy there are enought registers. However special nano-benchmarks can be considered, e.g. > > > @Benchmark > @OperationsPerInvocation(TESTSIZE) > public int max0_use8_i() { > int sum = 0; > for(int i = 0; i < TESTSIZE; i++) { > use8(0, 1, 2, 3, 4, 5, 6, 7); > sum += Math.max(i, 0); > } > return sum; > } > > @CompilerControl(CompilerControl.Mode.DONT_INLINE) > public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { > } > > > Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. > > New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms. > > Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. src/hotspot/cpu/aarch64/aarch64.ad line 15816: > 15814: %} > 15815: %} > 15816: Please put all this repetitive stuff into aarch64_ad.m4 and we'll review that. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From thartmann at openjdk.org Wed Dec 7 19:22:10 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 7 Dec 2022 19:22:10 GMT Subject: RFR: 8298272: Clean up ProblemList In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 15:03:52 GMT, Tobias Hartmann wrote: > Removed two entries from the problem list that refer to issues that were fixed/closed. Tests are running. > > Thanks, > Tobias Thanks for the reviews, Vladimir and Christian. ------------- PR: https://git.openjdk.org/jdk/pull/11561 From svkamath at openjdk.org Wed Dec 7 20:52:29 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 7 Dec 2022 20:52:29 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v8] In-Reply-To: References: Message-ID: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11471/files - new: https://git.openjdk.org/jdk/pull/11471/files/4b1e1270..944f4f3e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=06-07 Stats: 57 lines in 9 files changed: 19 ins; 7 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Wed Dec 7 21:47:08 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 7 Dec 2022 21:47:08 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: <1LzEo11RP7xfy7DNfAT34UMp9XXPANPQaUZg9rTqsls=.6325ba95-af4f-46fc-8ab5-a223f2aafe83@github.com> On Wed, 7 Dec 2022 05:18:34 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated test case > > New test failed when run with `-XX:UseAVX=1`. I added output to RFE in comment. > > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\\d+(\\s){2}(VectorCastH2F.*)+(\\s){2}===.*)" > - Failed comparison: [found] 0 > 0 [given] > - No nodes matched! @vnkozlov I have addressed comments from Fei Gao and Xiaohong Gong. I have limited vectorization to avx2 and higher. If the changes look good to you, could you kindly run the tests? Thanks for all your help. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 22:15:55 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 22:15:55 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: Message-ID: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> On Wed, 7 Dec 2022 05:18:34 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated test case > > New test failed when run with `-XX:UseAVX=1`. I added output to RFE in comment. > > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\\d+(\\s){2}(VectorCastH2F.*)+(\\s){2}===.*)" > - Failed comparison: [found] 0 > 0 [given] > - No nodes matched! > @vnkozlov I have addressed comments from Fei Gao and Xiaohong Gong. I have limited vectorization to avx2 and higher. If the changes look good to you, could you kindly run the tests? Thanks for all your help. @smita-kamath, can you explain why it does not work with AVX1? If it really requires AVX2 then you should just disable F16C for `(AVX < 2)` instead of current `(AVX < 1)` in `vm_version_x86.cpp`. And you would not need to modify `.ad` file and test. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Wed Dec 7 22:48:11 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 7 Dec 2022 22:48:11 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> References: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> Message-ID: On Wed, 7 Dec 2022 22:12:20 GMT, Vladimir Kozlov wrote: >> New test failed when run with `-XX:UseAVX=1`. I added output to RFE in comment. >> >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\\d+(\\s){2}(VectorCastH2F.*)+(\\s){2}===.*)" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! > >> @vnkozlov I have addressed comments from Fei Gao and Xiaohong Gong. I have limited vectorization to avx2 and higher. If the changes look good to you, could you kindly run the tests? Thanks for all your help. > > @smita-kamath, can you explain why it does not work with AVX1? If it really requires AVX2 then you should just disable F16C for `(AVX < 2)` instead of current `(AVX < 1)` in `vm_version_x86.cpp`. And you would not need to modify `.ad` file and test. @vnkozlov you are right. It should work with AVX=1. I will make the changes. Thank you for your comment. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Wed Dec 7 23:31:24 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 7 Dec 2022 23:31:24 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Updated test case and updated code as per review comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11471/files - new: https://git.openjdk.org/jdk/pull/11471/files/944f4f3e..dc7d728c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11471&range=07-08 Stats: 4 lines in 2 files changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11471.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11471/head:pull/11471 PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Wed Dec 7 23:38:46 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 7 Dec 2022 23:38:46 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> References: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> Message-ID: <5ymN4bLq2yiSrr19FX909Qy7Y7h6qhCb8A_tIaY4vdQ=.edbb02ce-6e30-4753-9f06-fc198da5caea@github.com> On Wed, 7 Dec 2022 22:12:20 GMT, Vladimir Kozlov wrote: >> New test failed when run with `-XX:UseAVX=1`. I added output to RFE in comment. >> >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\\d+(\\s){2}(VectorCastH2F.*)+(\\s){2}===.*)" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! > >> @vnkozlov I have addressed comments from Fei Gao and Xiaohong Gong. I have limited vectorization to avx2 and higher. If the changes look good to you, could you kindly run the tests? Thanks for all your help. > > @smita-kamath, can you explain why it does not work with AVX1? If it really requires AVX2 then you should just disable F16C for `(AVX < 2)` instead of current `(AVX < 1)` in `vm_version_x86.cpp`. And you would not need to modify `.ad` file and test. @vnkozlov I have updated the test case to work with AVX=1. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Wed Dec 7 23:57:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 7 Dec 2022 23:57:07 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> References: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> Message-ID: On Wed, 7 Dec 2022 22:12:20 GMT, Vladimir Kozlov wrote: >> New test failed when run with `-XX:UseAVX=1`. I added output to RFE in comment. >> >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\\d+(\\s){2}(VectorCastH2F.*)+(\\s){2}===.*)" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! > >> @vnkozlov I have addressed comments from Fei Gao and Xiaohong Gong. I have limited vectorization to avx2 and higher. If the changes look good to you, could you kindly run the tests? Thanks for all your help. > > @smita-kamath, can you explain why it does not work with AVX1? If it really requires AVX2 then you should just disable F16C for `(AVX < 2)` instead of current `(AVX < 1)` in `vm_version_x86.cpp`. And you would not need to modify `.ad` file and test. > @vnkozlov I have updated the test case to work with AVX=1. Can you explain what was wrong with AVX1 and what change fixed the issue? I see you renamed classes and addressed @fg1417 comment about `opcode`. It is not clear to me what fixed AVX1 issue. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From sviswanathan at openjdk.org Thu Dec 8 00:30:01 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 8 Dec 2022 00:30:01 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> Message-ID: On Wed, 7 Dec 2022 23:54:50 GMT, Vladimir Kozlov wrote: >>> @vnkozlov I have addressed comments from Fei Gao and Xiaohong Gong. I have limited vectorization to avx2 and higher. If the changes look good to you, could you kindly run the tests? Thanks for all your help. >> >> @smita-kamath, can you explain why it does not work with AVX1? If it really requires AVX2 then you should just disable F16C for `(AVX < 2)` instead of current `(AVX < 1)` in `vm_version_x86.cpp`. And you would not need to modify `.ad` file and test. > >> @vnkozlov I have updated the test case to work with AVX=1. > > Can you explain what was wrong with AVX1 and what change fixed the issue? > I see you renamed classes and addressed @fg1417 comment about `opcode`. It is not clear to me what fixed AVX1 issue. @vnkozlov The test was failing earlier with -XX:UseAVX=1 because the right implemented() check was not happening as Fei Gao explained. In vectornode.cpp, method VectorCastNode::implemented() was not getting the right vopc (VectorCastF2X, VectorCastS2X instead of VectorCastF2HF and VectorCastHF2F) after call to VectorCastNode::opcode() and so the Matcher::match_rule_supported_superword() was called with wrong vopc. This is now fixed as Smita has fixed the VectorCastNode::opcode() and VectorCastNode::implemented(). ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Thu Dec 8 00:40:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 00:40:54 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7] In-Reply-To: References: <9iLzRYsQr5T_DJP8agK6WVdTHs1lqf1BLZocDITGu54=.910b8cf3-7329-4299-a9e0-3ac5f94db979@github.com> Message-ID: On Thu, 8 Dec 2022 00:27:42 GMT, Sandhya Viswanathan wrote: >>> @vnkozlov I have updated the test case to work with AVX=1. >> >> Can you explain what was wrong with AVX1 and what change fixed the issue? >> I see you renamed classes and addressed @fg1417 comment about `opcode`. It is not clear to me what fixed AVX1 issue. > > @vnkozlov The test was failing earlier with -XX:UseAVX=1 because the right implemented() check was not happening as Fei Gao explained. In vectornode.cpp, method VectorCastNode::implemented() was not getting the right vopc (VectorCastF2X, VectorCastS2X instead of VectorCastF2HF and VectorCastHF2F) after call to VectorCastNode::opcode() and so the Matcher::match_rule_supported_superword() was called with wrong vopc. This is now fixed as Smita has fixed the VectorCastNode::opcode() and VectorCastNode::implemented(). Thank you @sviswa7 for explanation! Good. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From kvn at openjdk.org Thu Dec 8 00:40:57 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 00:40:57 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 23:31:24 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case and updated code as per review comment I started new testing after verifying locally that test passed with `-XX:UseAVX=1`. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From jbhateja at openjdk.org Thu Dec 8 02:12:08 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 8 Dec 2022 02:12:08 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 00:37:48 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated test case and updated code as per review comment > > I started new testing after verifying locally that test passed with `-XX:UseAVX=1`. > @vnkozlov The test was failing earlier with -XX:UseAVX=1 because the right implemented() check was not happening as Fei Gao explained. In vectornode.cpp, method VectorCastNode::implemented() was not getting the right vopc (VectorCastF2X, VectorCastS2X instead of VectorCastF2HF and VectorCastHF2F) after call to VectorCastNode::opcode() and so the Matcher::match_rule_supported_superword() was called with wrong vopc. This is now fixed as Smita has fixed the VectorCastNode::opcode() and VectorCastNode::implemented(). Also, the IR test was only enabled for avx512f earlier, which some how over shadowed the problem. Since VM features are queried using CPUID hence matcher will give up if both F16C and AVX512F are not present. Hi @smita-kamath , we should not explicitly disable the F16C in vm_version. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From sviswanathan at openjdk.org Thu Dec 8 03:00:56 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 8 Dec 2022 03:00:56 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 00:37:48 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated test case and updated code as per review comment > > I started new testing after verifying locally that test passed with `-XX:UseAVX=1`. > > @vnkozlov The test was failing earlier with -XX:UseAVX=1 because the right implemented() check was not happening as Fei Gao explained. In vectornode.cpp, method VectorCastNode::implemented() was not getting the right vopc (VectorCastF2X, VectorCastS2X instead of VectorCastF2HF and VectorCastHF2F) after call to VectorCastNode::opcode() and so the Matcher::match_rule_supported_superword() was called with wrong vopc. This is now fixed as Smita has fixed the VectorCastNode::opcode() and VectorCastNode::implemented(). > > Also, the IR test was only enabled for avx512f earlier, which some how over shadowed the problem. Since VM features are queried using CPUID hence matcher will give up if both F16C and AVX512F are not present. Hi @smita-kamath , we should not explicitly disable the F16C in vm_version. @jatin-bhateja When User sets -XX:UseAVX=0 on command line F16C needs to be disabled explicitly (in vm_version) as it needs AVX support. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From jbhateja at openjdk.org Thu Dec 8 03:19:01 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 8 Dec 2022 03:19:01 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 00:37:48 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated test case and updated code as per review comment > > I started new testing after verifying locally that test passed with `-XX:UseAVX=1`. > > > @vnkozlov The test was failing earlier with -XX:UseAVX=1 because the right implemented() check was not happening as Fei Gao explained. In vectornode.cpp, method VectorCastNode::implemented() was not getting the right vopc (VectorCastF2X, VectorCastS2X instead of VectorCastF2HF and VectorCastHF2F) after call to VectorCastNode::opcode() and so the Matcher::match_rule_supported_superword() was called with wrong vopc. This is now fixed as Smita has fixed the VectorCastNode::opcode() and VectorCastNode::implemented(). > > > > > > Also, the IR test was only enabled for avx512f earlier, which some how over shadowed the problem. Since VM features are queried using CPUID hence matcher will give up if both F16C and AVX512F are not present. Hi @smita-kamath , we should not explicitly disable the F16C in vm_version. > > @jatin-bhateja When User sets -XX:UseAVX=0 on command line F16C needs to be disabled explicitly (in vm_version) as it needs AVX support. Thanks for explanation. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From gcao at openjdk.org Thu Dec 8 03:34:48 2022 From: gcao at openjdk.org (Gui Cao) Date: Thu, 8 Dec 2022 03:34:48 GMT Subject: RFR: 8297238: RISC-V: Fix another two C2 IR matching tests for RISC-V Message-ID: Fix two IR matching tests that failed on RISC-V. Vector api Node will be matched only when UseRVV is enabled: - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java Please take a look and have some reviews. Thanks a lot. ## Testing: - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected (on unmatched board) - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected (on unmatched board) ------------- Commit messages: - RISC-V: Fix another two C2 IR matching tests for RISC-V Changes: https://git.openjdk.org/jdk/pull/11577/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11577&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297238 Stats: 5 lines in 3 files changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11577.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11577/head:pull/11577 PR: https://git.openjdk.org/jdk/pull/11577 From fyang at openjdk.org Thu Dec 8 03:53:10 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 8 Dec 2022 03:53:10 GMT Subject: RFR: 8298345: Fix another two C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 03:26:38 GMT, Gui Cao wrote: > Fix two IR matching tests that failed on RISC-V. > > Vector api Node will be matched only when UseRVV is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java > - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - fastdebug on unmatched board without support for RVV > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected > - fastdebug with -XX:+UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - The C2 graph generated by the test is as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java -The C2 graph generated by the test is as expected > - fastdebug with -XX:-UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected Looks reasonable to me. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11577 From dzhang at openjdk.org Thu Dec 8 03:58:57 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 8 Dec 2022 03:58:57 GMT Subject: RFR: 8298345: Fix another two C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 03:26:38 GMT, Gui Cao wrote: > Fix two IR matching tests that failed on RISC-V. > > Vector api Node will be matched only when UseRVV is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java > - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - fastdebug on unmatched board without support for RVV > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected > - fastdebug with -XX:+UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - The C2 graph generated by the test is as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java -The C2 graph generated by the test is as expected > - fastdebug with -XX:-UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected LGTM, thanks! (Not a reviewer.) ------------- Marked as reviewed by dzhang (Author). PR: https://git.openjdk.org/jdk/pull/11577 From kvn at openjdk.org Thu Dec 8 05:16:30 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 05:16:30 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: <5YgtSPDAq8exRODGc290FH-gzeDR7iFyAHnqApUIKEw=.dec7023b-728b-4318-ab8e-2f9246354d76@github.com> On Wed, 7 Dec 2022 23:31:24 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case and updated code as per review comment Unfortunately I have to restart testing because JTREG version was update but I did not update my local repo which caused half of tests failed with "harness" error :^( Good news is the test passed in this testing (hotspot vector testing passed a whole). ------------- PR: https://git.openjdk.org/jdk/pull/11471 From svkamath at openjdk.org Thu Dec 8 05:35:47 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Thu, 8 Dec 2022 05:35:47 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: <5YgtSPDAq8exRODGc290FH-gzeDR7iFyAHnqApUIKEw=.dec7023b-728b-4318-ab8e-2f9246354d76@github.com> References: <5YgtSPDAq8exRODGc290FH-gzeDR7iFyAHnqApUIKEw=.dec7023b-728b-4318-ab8e-2f9246354d76@github.com> Message-ID: <0wCD-9ywlgGhlJEFwNpyVCRjKi14tIj8QROcndGPefc=.2482cfa8-32b9-4da5-b22d-57713ec513f0@github.com> On Thu, 8 Dec 2022 05:13:45 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated test case and updated code as per review comment > > Good news is the test passed in this testing (hotspot vector testing passed a whole). @vnkozlov, Thanks so much for running the tests. I really appreciate your help. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From fjiang at openjdk.org Thu Dec 8 06:28:55 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 8 Dec 2022 06:28:55 GMT Subject: RFR: 8298345: Fix another two C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: <4qfhqN5QV07zDSKIbFjv-Tktnh6tNN1XObd2dzm4bzA=.a16fd705-f550-47e7-82a4-0fa8b5aa70b5@github.com> On Thu, 8 Dec 2022 03:26:38 GMT, Gui Cao wrote: > Fix two IR matching tests that failed on RISC-V. > > Vector api Node will be matched only when UseRVV is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java > - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - fastdebug on unmatched board without support for RVV > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected > - fastdebug with -XX:+UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - The C2 graph generated by the test is as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java -The C2 graph generated by the test is as expected > - fastdebug with -XX:-UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected Thanks for the fixing, with one comment: test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java line 32: > 30: * @bug 8279258 > 31: * @summary Auto-vectorization enhancement for two-dimensional array operations > 32: * @requires ((os.arch == "x86" | os.arch == "i386") & (vm.opt.UseSSE == "null" | vm.opt.UseSSE >= 2)) Did you check that still works on other arches? e.g.: x64/aarch64 ------------- PR: https://git.openjdk.org/jdk/pull/11577 From fgao at openjdk.org Thu Dec 8 07:36:11 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 8 Dec 2022 07:36:11 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 23:31:24 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case and updated code as per review comment Thanks for your update. The change involving superword and vectornode parts looks good to me now. ------------- Marked as reviewed by fgao (Author). PR: https://git.openjdk.org/jdk/pull/11471 From xgong at openjdk.org Thu Dec 8 07:36:12 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 8 Dec 2022 07:36:12 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 23:31:24 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case and updated code as per review comment LGTM, thanks for the update! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/11471 From chagedorn at openjdk.org Thu Dec 8 08:06:36 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Dec 2022 08:06:36 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v3] In-Reply-To: References: Message-ID: <_q0f-BeipXvrAJIMwn57p2ICgR-X8P2cLITVihDTzt0=.f12ac5ce-e5dc-4134-81ea-7046670a39d1@github.com> On Wed, 7 Dec 2022 17:08:23 GMT, Emanuel Peter wrote: >> **Targetted for JDK-21.** >> >> We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. >> Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. >> This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. >> >> Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. >> >> FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8257197 > - Review suggestions from Christian > - 8257197: Add additional verification code to PhaseCCP Thanks for doing the updates, looks good! src/hotspot/share/opto/phaseX.cpp line 1817: > 1815: if (told != tnew) { > 1816: // Check special cases that are ok > 1817: if (told->isa_integer(tnew->basic_type()) ) { // both either int or long As `isa_integer()` returns a pointer, it might be better to explicitly check with `!= nullptr`. Suggestion: if (told->isa_integer(tnew->basic_type()) != nullptr) { // both either int or long src/hotspot/share/opto/phaseX.cpp line 1819: > 1817: if (told->isa_integer(tnew->basic_type()) ) { // both either int or long > 1818: const TypeInteger *t0 = told->is_integer(tnew->basic_type()); > 1819: const TypeInteger *t1 = tnew->is_integer(tnew->basic_type()); Suggestion: const TypeInteger* t0 = told->is_integer(tnew->basic_type()); const TypeInteger* t1 = tnew->is_integer(tnew->basic_type()); ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11529 From epeter at openjdk.org Thu Dec 8 08:46:14 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 8 Dec 2022 08:46:14 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v4] In-Reply-To: References: Message-ID: > **Targetted for JDK-21.** > > We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. > Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. > This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. > > Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. > > FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Thanks Christian Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11529/files - new: https://git.openjdk.org/jdk/pull/11529/files/db08362b..35302e3d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11529&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11529&range=02-03 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11529.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11529/head:pull/11529 PR: https://git.openjdk.org/jdk/pull/11529 From kvn at openjdk.org Thu Dec 8 08:56:13 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 08:56:13 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 23:31:24 GMT, Smita Kamath wrote: >> Hi All, >> >> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. >> Following are the performance numbers of JMH micro Fp16ConversionBenchmark: >> Before code changes: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms >> >> After: >> Benchmark | (size) | Mode | Cnt | Score | Error | Units >> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms >> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms >> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms >> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms >> >> Kindly review and share your feedback. >> >> Thanks. >> Smita > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated test case and updated code as per review comment Latest testing results are good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11471 From aboldtch at openjdk.org Thu Dec 8 09:01:08 2022 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Thu, 8 Dec 2022 09:01:08 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax [v3] In-Reply-To: References: Message-ID: <0fGmbCvEUHokqoFR0ndPwd7yA5I_SS-MlBneCPP3LqY=.c4b67306-e5bb-4af5-8875-adba9ea9d0f1@github.com> > Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. > > The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. > > This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. > > The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. > > There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). > > It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. > > I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: > * Is there some other way of expressing in the .ad file that a memory input should not share some register? > * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. > * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? > > Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) Axel Boldt-Christmas has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Remove problem listed tests - Merge remote-tracking branch 'upstream_jdk/master' into JDK-8297235 - indirect zXChgP as well - indirect alternative - JDK-8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11410/files - new: https://git.openjdk.org/jdk/pull/11410/files/42a72c1e..0c715331 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11410&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11410&range=01-02 Stats: 116454 lines in 1834 files changed: 56104 ins; 41536 del; 18814 mod Patch: https://git.openjdk.org/jdk/pull/11410.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11410/head:pull/11410 PR: https://git.openjdk.org/jdk/pull/11410 From gcao at openjdk.org Thu Dec 8 09:04:02 2022 From: gcao at openjdk.org (Gui Cao) Date: Thu, 8 Dec 2022 09:04:02 GMT Subject: RFR: 8298345: Fix another two C2 IR matching tests for RISC-V In-Reply-To: <4qfhqN5QV07zDSKIbFjv-Tktnh6tNN1XObd2dzm4bzA=.a16fd705-f550-47e7-82a4-0fa8b5aa70b5@github.com> References: <4qfhqN5QV07zDSKIbFjv-Tktnh6tNN1XObd2dzm4bzA=.a16fd705-f550-47e7-82a4-0fa8b5aa70b5@github.com> Message-ID: On Thu, 8 Dec 2022 06:26:20 GMT, Feilong Jiang wrote: > Did you check that still works on other arches? e.g.: x64/aarch64 Hi, thanks for the review. I tested it under x86_64 GNU/Linux and aarch64 GNU/Linux, and both passed the test normally. ------------- PR: https://git.openjdk.org/jdk/pull/11577 From fjiang at openjdk.org Thu Dec 8 09:07:02 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 8 Dec 2022 09:07:02 GMT Subject: RFR: 8298345: Fix another two C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: <-r6tIXlb6C2uQ3OB3lbHRS2ae8iXEauOxcXsnc4FFzc=.5cefefc9-c87e-4cea-a516-4722c31e2428@github.com> On Thu, 8 Dec 2022 03:26:38 GMT, Gui Cao wrote: > Fix two IR matching tests that failed on RISC-V. > > Vector api Node will be matched only when UseRVV is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java > - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - fastdebug on unmatched board without support for RVV > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected > - fastdebug with -XX:+UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - The C2 graph generated by the test is as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java -The C2 graph generated by the test is as expected > - fastdebug with -XX:-UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected Marked as reviewed by fjiang (Author). ------------- PR: https://git.openjdk.org/jdk/pull/11577 From chagedorn at openjdk.org Thu Dec 8 09:07:09 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Dec 2022 09:07:09 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v4] In-Reply-To: References: Message-ID: <__SHTupeNf_Eppio4_ifFaZINO89umRehitbNSrPEmk=.6eebce50-cbfd-4823-9fd6-67949695a999@github.com> On Thu, 8 Dec 2022 08:46:14 GMT, Emanuel Peter wrote: >> **Targetted for JDK-21.** >> >> We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. >> Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. >> This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. >> >> Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. >> >> FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Thanks Christian > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11529 From svkamath at openjdk.org Thu Dec 8 09:09:08 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Thu, 8 Dec 2022 09:09:08 GMT Subject: RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v9] In-Reply-To: <5YgtSPDAq8exRODGc290FH-gzeDR7iFyAHnqApUIKEw=.dec7023b-728b-4318-ab8e-2f9246354d76@github.com> References: <5YgtSPDAq8exRODGc290FH-gzeDR7iFyAHnqApUIKEw=.dec7023b-728b-4318-ab8e-2f9246354d76@github.com> Message-ID: On Thu, 8 Dec 2022 05:13:45 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated test case and updated code as per review comment > > Good news is the test passed in this testing (hotspot vector testing passed a whole). @vnkozlov Thanks a lot for your review comments and for testing this patch. ------------- PR: https://git.openjdk.org/jdk/pull/11471 From qamai at openjdk.org Thu Dec 8 09:14:19 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 8 Dec 2022 09:14:19 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 18:45:06 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test > > Rerunning DaCapo and Renaissance which shows some variations. @vnkozlov I believe `VectorTestNode` is only used in Vector API, which should not affect those benchmarks? ------------- PR: https://git.openjdk.org/jdk/pull/9855 From kvn at openjdk.org Thu Dec 8 09:24:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 09:24:06 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix test Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9855 From kvn at openjdk.org Thu Dec 8 09:24:09 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 09:24:09 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 18:45:06 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test > > Rerunning DaCapo and Renaissance which shows some variations. > @vnkozlov I believe `VectorTestNode` is only used in Vector API, which should not affect those benchmarks? Yes, but you have changes in general code related to `Cmp` node. I see that DaCapo don't show regression after rerun. Renaissance is still running and numbers are "all over places" - it is not stable. Anyway, I am approving this changes based on data I got. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From svkamath at openjdk.org Thu Dec 8 09:51:06 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Thu, 8 Dec 2022 09:51:06 GMT Subject: Integrated: 8294588: Auto vectorize half precision floating point conversion APIs In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:22:39 GMT, Smita Kamath wrote: > Hi All, > > I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's. > Following are the performance numbers of JMH micro Fp16ConversionBenchmark: > Before code changes: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ????? 0.041 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ? 11765.453 | ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ????? 0.653 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ??? 361.696 | ops/ms > > After: > Benchmark | (size) | Mode | Cnt | Score | Error | Units > Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 20460.349 |? 372.327 | ops/ms > Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2342125.200 |? 9250.899 |ops/ms > Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 22553.977 |? 483.034 | ops/ms > Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007899.797 |? 150.296 | ops/ms > > Kindly review and share your feedback. > > Thanks. > Smita This pull request has now been integrated. Changeset: 073897c8 Author: Smita Kamath Committer: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/073897c88bbc430e8751a18baf7487f6474fd0c3 Stats: 231 lines in 12 files changed: 221 ins; 0 del; 10 mod 8294588: Auto vectorize half precision floating point conversion APIs Reviewed-by: sviswanathan, kvn, jbhateja, fgao, xgong ------------- PR: https://git.openjdk.org/jdk/pull/11471 From chagedorn at openjdk.org Thu Dec 8 10:16:08 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Dec 2022 10:16:08 GMT Subject: RFR: 8295116: C2: assert(dead->outcnt() == 0 && !dead->is_top()) failed: node must be dead Message-ID: In `IfNode::fold_compares_helper()`, we merge two immediately following `If` nodes with a `CmpI` that share the left input into a single `If` with `CmpU`. In this case here, a graph is currently dying. The `If` nodes are not yet removed but data was already folded in such a way that the graph currently looks like this when trying to apply `IfNode::fold_compares()`: ![image](https://user-images.githubusercontent.com/17833009/206415585-083c6558-84d1-4787-b766-01be8b64bf10.png) In `IfNode::is_ctrl_folds()` we check if the dominating `If` is suited to apply this optimization. Here we check if the left inputs of the `CmpI` nodes are the same which is indeed the case because both `CmpI` have `top` as left input: https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L734-L736 We start merging `1641 If` and `1657 If`. During this process, we try to remove dead nodes that are no longer used: https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L1042-L1044 But in this specific setup, `adjusted_val` is `top` and we try to remove it because `outcnt()` for `top` is zero. This results in the assertion failure. The fix I propose is to not only check for equality of the left inputs in `IfNode::is_ctrl_folds()` but also check if one of them is `top`. I was only able to reproduce this bug with a replay file and a very specific seed when using `-XX:+StressIGVN`. Thanks, Christian ------------- Commit messages: - 8295116: C2: assert(dead->outcnt() == 0 && !dead->is_top()) failed: node must be dead Changes: https://git.openjdk.org/jdk/pull/11581/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11581&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295116 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11581.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11581/head:pull/11581 PR: https://git.openjdk.org/jdk/pull/11581 From thartmann at openjdk.org Thu Dec 8 10:31:08 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 8 Dec 2022 10:31:08 GMT Subject: RFR: 8295116: C2: assert(dead->outcnt() == 0 && !dead->is_top()) failed: node must be dead In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 10:06:32 GMT, Christian Hagedorn wrote: > In `IfNode::fold_compares_helper()`, we merge two immediately following `If` nodes with a `CmpI` that share the left input into a single `If` with `CmpU`. > > In this case here, a graph is currently dying. The `If` nodes are not yet removed but data was already folded in such a way that the graph currently looks like this when trying to apply `IfNode::fold_compares()`: > > ![image](https://user-images.githubusercontent.com/17833009/206415585-083c6558-84d1-4787-b766-01be8b64bf10.png) > > In `IfNode::is_ctrl_folds()` we check if the dominating `If` is suited to apply this optimization. Here we check if the left inputs of the `CmpI` nodes are the same which is indeed the case because both `CmpI` have `top` as left input: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L734-L736 > > We start merging `1641 If` and `1657 If`. During this process, we try to remove dead nodes that are no longer used: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L1042-L1044 > > But in this specific setup, `adjusted_val` is `top` and we try to remove it because `outcnt()` for `top` is zero. This results in the assertion failure. > > The fix I propose is to not only check for equality of the left inputs in `IfNode::is_ctrl_folds()` but also check if one of them is `top`. > > I was only able to reproduce this bug with a replay file and a very specific seed when using `-XX:+StressIGVN`. > > Thanks, > Christian Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11581 From thartmann at openjdk.org Thu Dec 8 10:32:11 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 8 Dec 2022 10:32:11 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v3] In-Reply-To: References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> Message-ID: On Wed, 7 Dec 2022 17:49:27 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Add ResourceMark" > > This reverts commit dc1074d01b4bd52740e5e0396976232f268380e5. Marked as reviewed by thartmann (Reviewer). Ah, good catch. The current version looks good then! ------------- PR: https://git.openjdk.org/jdk/pull/11452 From rcastanedalo at openjdk.org Thu Dec 8 10:35:04 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 8 Dec 2022 10:35:04 GMT Subject: RFR: 8295116: C2: assert(dead->outcnt() == 0 && !dead->is_top()) failed: node must be dead In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 10:06:32 GMT, Christian Hagedorn wrote: > In `IfNode::fold_compares_helper()`, we merge two immediately following `If` nodes with a `CmpI` that share the left input into a single `If` with `CmpU`. > > In this case here, a graph is currently dying. The `If` nodes are not yet removed but data was already folded in such a way that the graph currently looks like this when trying to apply `IfNode::fold_compares()`: > > ![image](https://user-images.githubusercontent.com/17833009/206415585-083c6558-84d1-4787-b766-01be8b64bf10.png) > > In `IfNode::is_ctrl_folds()` we check if the dominating `If` is suited to apply this optimization. Here we check if the left inputs of the `CmpI` nodes are the same which is indeed the case because both `CmpI` have `top` as left input: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L734-L736 > > We start merging `1641 If` and `1657 If`. During this process, we try to remove dead nodes that are no longer used: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L1042-L1044 > > But in this specific setup, `adjusted_val` is `top` and we try to remove it because `outcnt()` for `top` is zero. This results in the assertion failure. > > The fix I propose is to not only check for equality of the left inputs in `IfNode::is_ctrl_folds()` but also check if one of them is `top`. > > I was only able to reproduce this bug with a replay file and a very specific seed when using `-XX:+StressIGVN`. > > Thanks, > Christian Nice analysis, looks good! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11581 From chagedorn at openjdk.org Thu Dec 8 11:23:29 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Dec 2022 11:23:29 GMT Subject: RFR: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph [v3] In-Reply-To: References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> Message-ID: On Wed, 7 Dec 2022 17:49:27 GMT, Christian Hagedorn wrote: >> The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. >> >> To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. >> >> I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Add ResourceMark" > > This reverts commit dc1074d01b4bd52740e5e0396976232f268380e5. Thanks Vladimir and Tobias for reviewing it again! I'll file an RFE to check the `ResourceMark` uses and possibly replace the arena for `_nodes` and `_types`. Testing looked good! I'll integrate it. ------------- PR: https://git.openjdk.org/jdk/pull/11452 From chagedorn at openjdk.org Thu Dec 8 11:27:01 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Dec 2022 11:27:01 GMT Subject: Integrated: 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph In-Reply-To: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> References: <8aYBVmKznsl1c3lU0ykT_WHRVZ5TRcpLTwkKgLMggqU=.fa223067-8e24-4f05-b8d7-721163cee55e@github.com> Message-ID: On Thu, 1 Dec 2022 12:26:56 GMT, Christian Hagedorn wrote: > The test cases of this bug reveal the same problem with `PhaseIdealLoop::create_new_if_for_predicate()` as in [JDK-8271954](https://bugs.openjdk.org/browse/JDK-8271954) (reusing pinned data nodes for different UCT paths into the UCT phi - see PR description of https://github.com/openjdk/jdk/pull/5185). While JDK-8271954 only fixed the usage of `PhaseIdealLoop::create_new_if_for_predicate()` for one specific case in loop unswitching, we now need this fix for other usages of `PhaseIdealLoop::create_new_if_for_predicate()` as well. I've found failing cases for most of the usages but I think we should always do a proper cloning as originally added with JDK-8271954. This is what a propose with this patch. > > To always do this cloning, I've replaced the `UnswitchingAction` by a `rewire_uncommon_proj_phi_inputs` bool that is false by default but can be set if we should only do a rewiring (we can still do the rewiring for the slow loop as previously done with `UnswitchingAction::SlowLoopRewiring`, also see https://github.com/openjdk/jdk/pull/5185 for more details). However, the current fix does not update ctrl for the slow loop nodes - I've fixed that. > > I've reused the implementation of `clone_data_nodes_for_fast_loop()` but refactored and split that method into multiple methods to reuse some of them for the ctrl update of the slow loop nodes. > > Thanks, > Christian This pull request has now been integrated. Changeset: 49b86224 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/49b86224aacc7fd8b4d3354a85d72ef636a18a12 Stats: 552 lines in 3 files changed: 455 ins; 32 del; 65 mod 8290850: C2: create_new_if_for_predicate() does not clone pinned phi input nodes resulting in a broken graph Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11452 From mdoerr at openjdk.org Thu Dec 8 11:35:06 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 8 Dec 2022 11:35:06 GMT Subject: RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 16:56:39 GMT, Tyler Steele wrote: >> This small change adds an import to the generated ad_ppc.cpp file to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Tyler Steele has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into build/aix/continuation-enabled > - Add continuation.hpp to adlc/main.cpp > - Set VMContinuations to false on AIX > - Restore 5 arg constructor for SystemProcess Mainline will get forked today. Please create a new PR in the JDK 20 stabilization repository once it's available. This one get get closed afterwards. src/hotspot/share/adlc/main.cpp line 232: > 230: AD.addInclude(AD._CPP_file, "opto/regmask.hpp"); > 231: AD.addInclude(AD._CPP_file, "opto/runtime.hpp"); > 232: AD.addInclude(AD._CPP_file, "runtime/continuation.hpp"); This adds it for all platforms. Isn't it sufficient to add it in `source %{` section of the ad file? ------------- PR: https://git.openjdk.org/jdk/pull/11550 From thartmann at openjdk.org Thu Dec 8 11:41:04 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 8 Dec 2022 11:41:04 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v2] In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 06:56:28 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > fix Overall, the fix looks reasonable to me. I added some comments. Please merge with master, currently the build fails with: src/hotspot/share/opto/type.cpp:4920:16: error: no declaration matches 'const TypePtr* TypeAryPtr::with_offset(int) const' [2022-12-08T10:34:55,711Z] 4920 | const TypePtr *TypeAryPtr::with_offset(int offset) const { [2022-12-08T10:34:55,711Z] | ^~~~~~~~~~ [2022-12-08T10:34:55,711Z] /src/hotspot/share/opto/type.cpp:4892:19: note: candidate is: 'virtual const TypeAryPtr* TypeAryPtr::with_offset(intptr_t) const' [2022-12-08T10:34:55,711Z] 4892 | const TypeAryPtr* TypeAryPtr::with_offset(intptr_t offset) const { src/hotspot/share/opto/memnode.cpp line 225: > 223: t->is_oopptr()->cast_to_exactness(true) > 224: ->is_oopptr()->cast_to_ptr_type(t_oop->ptr()) > 225: ->is_oopptr()->cast_to_instance_id(t_oop->instance_id()); I would prefer this formatting, similar to the assert. Suggestion: t->is_oopptr()->cast_to_exactness(true) ->is_oopptr()->cast_to_ptr_type(t_oop->ptr()) ->is_oopptr()->cast_to_instance_id(t_oop->instance_id()); src/hotspot/share/opto/memnode.cpp line 230: > 228: ->cast_to_size(t_oop->is_aryptr()->size()) > 229: ->with_offset(t_oop->is_aryptr()->offset()) > 230: ->is_aryptr(); Do we need `cast_to_stable` as well here? src/hotspot/share/opto/type.cpp line 4616: > 4614: } > 4615: > 4616: const TypePtr *TypeAryPtr::with_offset(int offset) const { I think you can use the existing https://github.com/openjdk/jdk/blob/49b86224aacc7fd8b4d3354a85d72ef636a18a12/src/hotspot/share/opto/type.cpp#L4892 test/hotspot/jtreg/compiler/c2/TestGVNCrash.java line 30: > 28: * @summary GVN Crash: assert() failed: correct memory chain > 29: * > 30: * @run main/othervm -XX:CompileCommand=compileonly,compiler.c2.TestGVNCrash::test compiler.c2.TestGVNCrash This does not reproduce the issue with latest JDK 20, please add `-Xbatch -XX:+UnlockDiagnosticVMOptions -XX:+StressIGVN` and `@key stress randomness`. ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9777 From chagedorn at openjdk.org Thu Dec 8 12:01:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Dec 2022 12:01:07 GMT Subject: RFR: 8295116: C2: assert(dead->outcnt() == 0 && !dead->is_top()) failed: node must be dead In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 10:06:32 GMT, Christian Hagedorn wrote: > In `IfNode::fold_compares_helper()`, we merge two immediately following `If` nodes with a `CmpI` that share the left input into a single `If` with `CmpU`. > > In this case here, a graph is currently dying. The `If` nodes are not yet removed but data was already folded in such a way that the graph currently looks like this when trying to apply `IfNode::fold_compares()`: > > ![image](https://user-images.githubusercontent.com/17833009/206415585-083c6558-84d1-4787-b766-01be8b64bf10.png) > > In `IfNode::is_ctrl_folds()` we check if the dominating `If` is suited to apply this optimization. Here we check if the left inputs of the `CmpI` nodes are the same which is indeed the case because both `CmpI` have `top` as left input: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L734-L736 > > We start merging `1641 If` and `1657 If`. During this process, we try to remove dead nodes that are no longer used: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L1042-L1044 > > But in this specific setup, `adjusted_val` is `top` and we try to remove it because `outcnt()` for `top` is zero. This results in the assertion failure. > > The fix I propose is to not only check for equality of the left inputs in `IfNode::is_ctrl_folds()` but also check if one of them is `top`. > > I was only able to reproduce this bug with a replay file and a very specific seed when using `-XX:+StressIGVN`. > > Thanks, > Christian Thanks Tobias and Roberto for the quick reviews! I'll integrate this now to get it in before the fork. ------------- PR: https://git.openjdk.org/jdk/pull/11581 From chagedorn at openjdk.org Thu Dec 8 12:04:35 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Dec 2022 12:04:35 GMT Subject: Integrated: 8295116: C2: assert(dead->outcnt() == 0 && !dead->is_top()) failed: node must be dead In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 10:06:32 GMT, Christian Hagedorn wrote: > In `IfNode::fold_compares_helper()`, we merge two immediately following `If` nodes with a `CmpI` that share the left input into a single `If` with `CmpU`. > > In this case here, a graph is currently dying. The `If` nodes are not yet removed but data was already folded in such a way that the graph currently looks like this when trying to apply `IfNode::fold_compares()`: > > ![image](https://user-images.githubusercontent.com/17833009/206415585-083c6558-84d1-4787-b766-01be8b64bf10.png) > > In `IfNode::is_ctrl_folds()` we check if the dominating `If` is suited to apply this optimization. Here we check if the left inputs of the `CmpI` nodes are the same which is indeed the case because both `CmpI` have `top` as left input: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L734-L736 > > We start merging `1641 If` and `1657 If`. During this process, we try to remove dead nodes that are no longer used: > https://github.com/openjdk/jdk/blob/46cd457b0f78996a3f26e44452de8f8a66041f58/src/hotspot/share/opto/ifnode.cpp#L1042-L1044 > > But in this specific setup, `adjusted_val` is `top` and we try to remove it because `outcnt()` for `top` is zero. This results in the assertion failure. > > The fix I propose is to not only check for equality of the left inputs in `IfNode::is_ctrl_folds()` but also check if one of them is `top`. > > I was only able to reproduce this bug with a replay file and a very specific seed when using `-XX:+StressIGVN`. > > Thanks, > Christian This pull request has now been integrated. Changeset: 94575d14 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/94575d14f47e2dfb11b671bce26b69270b6bb3c8 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8295116: C2: assert(dead->outcnt() == 0 && !dead->is_top()) failed: node must be dead Reviewed-by: thartmann, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/11581 From thartmann at openjdk.org Thu Dec 8 12:09:08 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 8 Dec 2022 12:09:08 GMT Subject: Integrated: 8298272: Clean up ProblemList In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 15:03:52 GMT, Tobias Hartmann wrote: > Removed two entries from the problem list that refer to issues that were fixed/closed. Tests are running. > > Thanks, > Tobias This pull request has now been integrated. Changeset: d8ef60b4 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/d8ef60b406a9e8fe6cc6b7be0b74e45de38604c5 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod 8298272: Clean up ProblemList Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11561 From thartmann at openjdk.org Thu Dec 8 12:29:53 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 8 Dec 2022 12:29:53 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v4] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 08:46:14 GMT, Emanuel Peter wrote: >> **Targetted for JDK-21.** >> >> We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. >> Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. >> This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. >> >> Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. >> >> FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Thanks Christian > > Co-authored-by: Christian Hagedorn Looks good! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11529 From stsypanov at openjdk.org Thu Dec 8 12:45:29 2022 From: stsypanov at openjdk.org (Sergey Tsypanov) Date: Thu, 8 Dec 2022 12:45:29 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base Message-ID: Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like void iterate(T[] items) { if (items.length == 0) { return; } for (T item : items) { //... } } Here if (items.length == 0) { return; } is redundant and can be removed as length check is performed by for-each loop. ------------- Commit messages: - 8298380: Clean up redundant array length checks in JDK code base Changes: https://git.openjdk.org/jdk/pull/11589/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11589&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298380 Stats: 51 lines in 8 files changed: 0 ins; 14 del; 37 mod Patch: https://git.openjdk.org/jdk/pull/11589.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11589/head:pull/11589 PR: https://git.openjdk.org/jdk/pull/11589 From qamai at openjdk.org Thu Dec 8 13:46:49 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 8 Dec 2022 13:46:49 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 09:21:47 GMT, Vladimir Kozlov wrote: >> Rerunning DaCapo and Renaissance which shows some variations. > >> @vnkozlov I believe `VectorTestNode` is only used in Vector API, which should not affect those benchmarks? > > Yes, but you have changes in general code related to `Cmp` node. > I see that DaCapo don't show regression after rerun. Renaissance is still running and numbers are "all over places" - it is not stable. > > Anyway, I am approving this changes based on data I got. @vnkozlov Ah I get it, thanks a lot for your reviews, can I merge the patch now? ------------- PR: https://git.openjdk.org/jdk/pull/9855 From epeter at openjdk.org Thu Dec 8 14:32:50 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 8 Dec 2022 14:32:50 GMT Subject: RFR: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination [v2] In-Reply-To: References: Message-ID: <5drE3J8CwkMg_-FnaTAve0b19tgwXt-_oOjhPAk8DKw=.dbb477ce-06c9-48aa-b43e-fd0afa593d18@github.com> > **Will hold this back until JDK21**, unless we decide it is a regression-fix for [JDK-8294217](https://bugs.openjdk.org/browse/JDK-8294217). The problem is only a not-quite-correct assert. But the problem is not limited to infinite loops, as the example below shows it can happen with reducible loops. > > **Background:** > We have an assert that checks that `has_loops` is true when it should be. If we have `has_loops == false` even though there are loops, we will not perform loop-opts in `Compile::Optimize`. > > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4285-L4293 > > Generally, we want to verify, that if we just found loops (`_ltree_root->_child != NULL`) that `has_loops == true`. > There are a few cases where we do not care if we miss loop-opts: > - We only have infinite loops (`only_has_infinite_loops()`). Infinite loops never terminate anyway, so why make them faster? Plus, a loop is only infinite if it has no loop-exit other than a `NeverBranch` exit, even uncommon traps, loop-limit checks etc are exits. Thus, if a loop does anything interesting, it probably is not such a "true infinite loop". They can be more easily forced to occur by setting `-XX:PerMethodTrapLimit=0`. > - We have only exception edges. > > Note that once we check the assert, we update `has_loops`. So if all loops disappeared, we avoid doing loop-opts henceforth. > > **Current implementation of PhaseIdealLoop::only_has_infinite_loops** > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4183-L4185 > > We check for loop exits, if there is one the loop should not be infinite. > > **The Problem** > > An infinte loop can have an inner loop, that subsequently loses its exit. It becomes its own infinite loop, and floats out of the outer loop. Where the outer loop enters into the former inner loop, we now have a loop-exit for the outer loop. The next time we run `build_loop_tree` and check the assert, it can fail, as `PhaseIdealLoop::only_has_infinite_loops` finds that new loop-exit from outer to inner loop. > > Example: `TestOnlyInfiniteLoops::test_simple` (click on images to see them larger) > > Nested infinite loop before loop-opts: > > > After `build_loop_tree`, the outer loop is detected as infinite, and `NeverBranch` is inserted. No loop is attached to loop-tree, as we do not attach newly discovered infinite loops. We will set `has_loops == false` after first loop-opts round. > > > During IGVN of first loop-opts round, some edges die. `88 IfTrue` is dominated by `52 IfTrue` (dominator info only becomes present during loop-opts). The outer loop now exits into the inner loop. > > > The second loop-opts round detects the former inner loop as an infinite loop, inserts NeverBranch. Once we run the assert, we see that we have `has_loops == false`, but `PhaseIdealLoop::only_has_infinite_loops` finds the former outer loop's exit. > > > **Solution** > If we ever only have infinite loops, then there will never be a way to get from any of those loops down to Root, except through a NeverBranch exit. So even if such an (outer) infinite loop ever has an exit, that exit cannot ever lead to Root, other than a NeverBranch exit. Thus, it is ok to still consider that loop as "infinite", even though it itself has an exit - that exit will never lead to termination. > Thus, I changed the `PhaseIdealLoop::only_has_infinite_loops` to check if any of the loops ever connect down to Root, except through NeverBranch nodes. > > **Alternative Fix** > An alternative idea to my fix here: just replace the infinite loop with a uncommon trap, and if the infinite loop is ever hit revert back to the interpreter. If we do not care to optimize infinite loops, then why even compile them? > Advantages of that idea: No need for `NeverBranch`, no need for special-casing infinite loops. > > I have another bug where assumptions are not true, because of infinite loops, and especially infinite loops not being attached to the loop-tree [JDK-8296318](https://bugs.openjdk.org/browse/JDK-8296318) > > I'm looking forward to your feedback, > Emanuel Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Thanks Tobias for the suggestion Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11473/files - new: https://git.openjdk.org/jdk/pull/11473/files/64852aa2..56a7c57b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11473&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11473&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/11473.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11473/head:pull/11473 PR: https://git.openjdk.org/jdk/pull/11473 From thartmann at openjdk.org Thu Dec 8 14:32:52 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 8 Dec 2022 14:32:52 GMT Subject: RFR: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination [v2] In-Reply-To: <5drE3J8CwkMg_-FnaTAve0b19tgwXt-_oOjhPAk8DKw=.dbb477ce-06c9-48aa-b43e-fd0afa593d18@github.com> References: <5drE3J8CwkMg_-FnaTAve0b19tgwXt-_oOjhPAk8DKw=.dbb477ce-06c9-48aa-b43e-fd0afa593d18@github.com> Message-ID: On Thu, 8 Dec 2022 14:28:39 GMT, Emanuel Peter wrote: >> **Will hold this back until JDK21**, unless we decide it is a regression-fix for [JDK-8294217](https://bugs.openjdk.org/browse/JDK-8294217). The problem is only a not-quite-correct assert. But the problem is not limited to infinite loops, as the example below shows it can happen with reducible loops. >> >> **Background:** >> We have an assert that checks that `has_loops` is true when it should be. If we have `has_loops == false` even though there are loops, we will not perform loop-opts in `Compile::Optimize`. >> >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4285-L4293 >> >> Generally, we want to verify, that if we just found loops (`_ltree_root->_child != NULL`) that `has_loops == true`. >> There are a few cases where we do not care if we miss loop-opts: >> - We only have infinite loops (`only_has_infinite_loops()`). Infinite loops never terminate anyway, so why make them faster? Plus, a loop is only infinite if it has no loop-exit other than a `NeverBranch` exit, even uncommon traps, loop-limit checks etc are exits. Thus, if a loop does anything interesting, it probably is not such a "true infinite loop". They can be more easily forced to occur by setting `-XX:PerMethodTrapLimit=0`. >> - We have only exception edges. >> >> Note that once we check the assert, we update `has_loops`. So if all loops disappeared, we avoid doing loop-opts henceforth. >> >> **Current implementation of PhaseIdealLoop::only_has_infinite_loops** >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4183-L4185 >> >> We check for loop exits, if there is one the loop should not be infinite. >> >> **The Problem** >> >> An infinte loop can have an inner loop, that subsequently loses its exit. It becomes its own infinite loop, and floats out of the outer loop. Where the outer loop enters into the former inner loop, we now have a loop-exit for the outer loop. The next time we run `build_loop_tree` and check the assert, it can fail, as `PhaseIdealLoop::only_has_infinite_loops` finds that new loop-exit from outer to inner loop. >> >> Example: `TestOnlyInfiniteLoops::test_simple` (click on images to see them larger) >> >> Nested infinite loop before loop-opts: >> >> >> After `build_loop_tree`, the outer loop is detected as infinite, and `NeverBranch` is inserted. No loop is attached to loop-tree, as we do not attach newly discovered infinite loops. We will set `has_loops == false` after first loop-opts round. >> >> >> During IGVN of first loop-opts round, some edges die. `88 IfTrue` is dominated by `52 IfTrue` (dominator info only becomes present during loop-opts). The outer loop now exits into the inner loop. >> >> >> The second loop-opts round detects the former inner loop as an infinite loop, inserts NeverBranch. Once we run the assert, we see that we have `has_loops == false`, but `PhaseIdealLoop::only_has_infinite_loops` finds the former outer loop's exit. >> >> >> **Solution** >> If we ever only have infinite loops, then there will never be a way to get from any of those loops down to Root, except through a NeverBranch exit. So even if such an (outer) infinite loop ever has an exit, that exit cannot ever lead to Root, other than a NeverBranch exit. Thus, it is ok to still consider that loop as "infinite", even though it itself has an exit - that exit will never lead to termination. >> Thus, I changed the `PhaseIdealLoop::only_has_infinite_loops` to check if any of the loops ever connect down to Root, except through NeverBranch nodes. >> >> **Alternative Fix** >> An alternative idea to my fix here: just replace the infinite loop with a uncommon trap, and if the infinite loop is ever hit revert back to the interpreter. If we do not care to optimize infinite loops, then why even compile them? >> Advantages of that idea: No need for `NeverBranch`, no need for special-casing infinite loops. >> >> I have another bug where assumptions are not true, because of infinite loops, and especially infinite loops not being attached to the loop-tree [JDK-8296318](https://bugs.openjdk.org/browse/JDK-8296318) >> >> I'm looking forward to your feedback, >> Emanuel > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Thanks Tobias for the suggestion > > Co-authored-by: Tobias Hartmann Marked as reviewed by thartmann (Reviewer). test/hotspot/jtreg/compiler/loopopts/TestOnlyInfiniteLoopsMain.java line 32: > 30: * -XX:CompileCommand=compileonly,TestOnlyInfiniteLoops::test* > 31: * -XX:-TieredCompilation -Xbatch -Xcomp > 32: * TestOnlyInfiniteLoopsMain Suggestion: * @run main/othervm * -XX:CompileCommand=compileonly,TestOnlyInfiniteLoops::test* * -XX:-TieredCompilation -Xcomp * TestOnlyInfiniteLoopsMain test/hotspot/jtreg/compiler/loopopts/TestOnlyInfiniteLoopsMain.java line 42: > 40: * -XX:-TieredCompilation -Xbatch -Xcomp > 41: * -XX:PerMethodTrapLimit=0 > 42: * TestOnlyInfiniteLoopsMain Suggestion: * @run main/othervm * -XX:CompileCommand=compileonly,TestOnlyInfiniteLoops::test* * -XX:-TieredCompilation -Xcomp * -XX:PerMethodTrapLimit=0 * TestOnlyInfiniteLoopsMain ------------- PR: https://git.openjdk.org/jdk/pull/11473 From epeter at openjdk.org Thu Dec 8 15:29:59 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 8 Dec 2022 15:29:59 GMT Subject: RFR: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination [v2] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 10:28:57 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review >> >> Thanks Tobias for the suggestion >> >> Co-authored-by: Tobias Hartmann > > Looks reasonable to me. Thanks @rwestrel and @TobiHartmann for the help and reviews! Sanity tested tier1 again. ------------- PR: https://git.openjdk.org/jdk/pull/11473 From epeter at openjdk.org Thu Dec 8 15:33:35 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 8 Dec 2022 15:33:35 GMT Subject: Integrated: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination In-Reply-To: References: Message-ID: <02IsdSRgcFtZt2YRY690yp56NgtQKwEDbtK5tv2nd7g=.46fcb0fe-4960-440f-8d01-5fdd486cfbc3@github.com> On Fri, 2 Dec 2022 08:11:06 GMT, Emanuel Peter wrote: > The bug was a regression-fix for [JDK-8294217](https://bugs.openjdk.org/browse/JDK-8294217). The problem is only a not-quite-correct assert. But the problem is not limited to infinite loops, as the example below shows it can happen with reducible loops. > > **Background:** > We have an assert that checks that `has_loops` is true when it should be. If we have `has_loops == false` even though there are loops, we will not perform loop-opts in `Compile::Optimize`. > > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4285-L4293 > > Generally, we want to verify, that if we just found loops (`_ltree_root->_child != NULL`) that `has_loops == true`. > There are a few cases where we do not care if we miss loop-opts: > - We only have infinite loops (`only_has_infinite_loops()`). Infinite loops never terminate anyway, so why make them faster? Plus, a loop is only infinite if it has no loop-exit other than a `NeverBranch` exit, even uncommon traps, loop-limit checks etc are exits. Thus, if a loop does anything interesting, it probably is not such a "true infinite loop". They can be more easily forced to occur by setting `-XX:PerMethodTrapLimit=0`. > - We have only exception edges. > > Note that once we check the assert, we update `has_loops`. So if all loops disappeared, we avoid doing loop-opts henceforth. > > **Current implementation of PhaseIdealLoop::only_has_infinite_loops** > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L4183-L4185 > > We check for loop exits, if there is one the loop should not be infinite. > > **The Problem** > > An infinte loop can have an inner loop, that subsequently loses its exit. It becomes its own infinite loop, and floats out of the outer loop. Where the outer loop enters into the former inner loop, we now have a loop-exit for the outer loop. The next time we run `build_loop_tree` and check the assert, it can fail, as `PhaseIdealLoop::only_has_infinite_loops` finds that new loop-exit from outer to inner loop. > > Example: `TestOnlyInfiniteLoops::test_simple` (click on images to see them larger) > > Nested infinite loop before loop-opts: > > > After `build_loop_tree`, the outer loop is detected as infinite, and `NeverBranch` is inserted. No loop is attached to loop-tree, as we do not attach newly discovered infinite loops. We will set `has_loops == false` after first loop-opts round. > > > During IGVN of first loop-opts round, some edges die. `88 IfTrue` is dominated by `52 IfTrue` (dominator info only becomes present during loop-opts). The outer loop now exits into the inner loop. > > > The second loop-opts round detects the former inner loop as an infinite loop, inserts NeverBranch. Once we run the assert, we see that we have `has_loops == false`, but `PhaseIdealLoop::only_has_infinite_loops` finds the former outer loop's exit. > > > **Solution** > If we ever only have infinite loops, then there will never be a way to get from any of those loops down to Root, except through a NeverBranch exit. So even if such an (outer) infinite loop ever has an exit, that exit cannot ever lead to Root, other than a NeverBranch exit. Thus, it is ok to still consider that loop as "infinite", even though it itself has an exit - that exit will never lead to termination. > Thus, I changed the `PhaseIdealLoop::only_has_infinite_loops` to check if any of the loops ever connect down to Root, except through NeverBranch nodes. > > **Alternative Fix** > An alternative idea to my fix here: just replace the infinite loop with a uncommon trap, and if the infinite loop is ever hit revert back to the interpreter. If we do not care to optimize infinite loops, then why even compile them? > Advantages of that idea: No need for `NeverBranch`, no need for special-casing infinite loops. > > I have another bug where assumptions are not true, because of infinite loops, and especially infinite loops not being attached to the loop-tree [JDK-8296318](https://bugs.openjdk.org/browse/JDK-8296318) > > I'm looking forward to your feedback, > Emanuel This pull request has now been integrated. Changeset: d562d3fc Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/d562d3fcbe22a0443037c5b447e1a41401275814 Stats: 184 lines in 3 files changed: 153 ins; 9 del; 22 mod 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination Reviewed-by: thartmann, roland ------------- PR: https://git.openjdk.org/jdk/pull/11473 From tsteele at openjdk.org Thu Dec 8 16:11:10 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 8 Dec 2022 16:11:10 GMT Subject: RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v3] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 11:29:28 GMT, Martin Doerr wrote: >> Tyler Steele has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge remote-tracking branch 'upstream/master' into build/aix/continuation-enabled >> - Add continuation.hpp to adlc/main.cpp >> - Set VMContinuations to false on AIX >> - Restore 5 arg constructor for SystemProcess > > src/hotspot/share/adlc/main.cpp line 232: > >> 230: AD.addInclude(AD._CPP_file, "opto/regmask.hpp"); >> 231: AD.addInclude(AD._CPP_file, "opto/runtime.hpp"); >> 232: AD.addInclude(AD._CPP_file, "runtime/continuation.hpp"); > > This adds it for all platforms. Isn't it sufficient to add it in `source %{` section of the ad file? I tried that first, but may have done so incorrectly. I am trying again. ------------- PR: https://git.openjdk.org/jdk/pull/11550 From roland at openjdk.org Thu Dec 8 16:13:30 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 8 Dec 2022 16:13:30 GMT Subject: RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed Message-ID: After some unrolling, when C2 runs loop opts with split if enabled after CCP, the limit of the main loop of the counted loop (the second loop in the test) is: limit - 3 That commons with the limit - 3 returned from the first loop. limit - 3 is thus in the first loop's body but only used outside of the loop. It has 3 uses: The return in the first loop, the OpaqueZeroTripGuard and loop exit conditionof the main loop. In the same pass of loop opts, limit-3 is cloned out of the loop 3 times for its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard and the loop exit condition now use 2 different nodes (until they common at next igvn), the assert fires. The fix I propose restores the behavior before the introduction of OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 use. ------------- Commit messages: - fix - test Changes: https://git.openjdk.org/jdk/pull/11596/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11596&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298353 Stats: 80 lines in 2 files changed: 79 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11596.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11596/head:pull/11596 PR: https://git.openjdk.org/jdk/pull/11596 From kvn at openjdk.org Thu Dec 8 16:21:21 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 16:21:21 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix test Yes, you can integrate. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From mdoerr at openjdk.org Thu Dec 8 16:22:06 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 8 Dec 2022 16:22:06 GMT Subject: RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 16:56:39 GMT, Tyler Steele wrote: >> This small change adds an import to the generated ad_ppc.cpp file to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Tyler Steele has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into build/aix/continuation-enabled > - Add continuation.hpp to adlc/main.cpp > - Set VMContinuations to false on AIX > - Restore 5 arg constructor for SystemProcess https://github.com/openjdk/jdk20 is now open for P3 bug fixes. (They will get merged into this repo after some time.) ------------- PR: https://git.openjdk.org/jdk/pull/11550 From tsteele at openjdk.org Thu Dec 8 16:26:18 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 8 Dec 2022 16:26:18 GMT Subject: RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v3] In-Reply-To: References: Message-ID: <8ldwwf5Y0ewGjSPaptTgJMwRGsSMhlRqnu4TS5sprEs=.e949e0b6-efbe-4083-aacb-c15a3ca2d060@github.com> On Thu, 8 Dec 2022 16:19:45 GMT, Martin Doerr wrote: > https://github.com/openjdk/jdk20 is now open for P3 bug fixes. (They will get merged into this repo after some time.) Thanks. I am closing this PR, and will re-open one on jdk20 after learning if the ppc.ad modification is feasible. The build is looking good so far, so I suspect it was my error that kept it from working the first time. ------------- PR: https://git.openjdk.org/jdk/pull/11550 From tsteele at openjdk.org Thu Dec 8 16:26:19 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 8 Dec 2022 16:26:19 GMT Subject: Withdrawn: 8298225: [AIX] Disable PPC64LE continuations on AIX In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 00:05:43 GMT, Tyler Steele wrote: > This small change adds an import to the generated ad_ppc.cpp file to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/11550 From kvn at openjdk.org Thu Dec 8 17:58:17 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 8 Dec 2022 17:58:17 GMT Subject: RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 16:04:16 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. The fix is good but I suggest to rebase it to new https://github.com/openjdk/jdk20 fork before we start testing it. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11596 From xliu at openjdk.org Thu Dec 8 18:17:01 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 8 Dec 2022 18:17:01 GMT Subject: RFR: 8298320: Typo in the comment block of catch_inline_exception Message-ID: <28BsntHiy-hfTs75vt4A6RD3g5RwcKu3EEClii24P1M=.7fab17bc-abd4-405f-b17a-016f615e6a79@github.com> The following comment makes reference to 'Deutsch-Shiffman'. I believe it's a typo. It should be 'Schiffman' if the author intent to cite this paper: > Deutsch, L. Peter, and Allan M. Schiffman. "Efficient implementation of the Smalltalk-80 system." Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. 1984. I ask 'Deutsch-Shiffman' to google and this is what google answers me. seems reasonable. // Case 2: we have some handlers, with loaded exception klasses that have // no subklasses. We do a Deutsch-Shiffman style type-check on the incoming // exception oop and branch to the handler directly. ... void Parse::catch_inline_exceptions(SafePointNode* ex_map) { ------------- Commit messages: - 8298320: Typo in the comment block of catch_inline_exception Changes: https://git.openjdk.org/jdk/pull/11598/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11598&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298320 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11598.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11598/head:pull/11598 PR: https://git.openjdk.org/jdk/pull/11598 From tsteele at openjdk.org Thu Dec 8 19:03:18 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 8 Dec 2022 19:03:18 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX Message-ID: This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. ------------- Commit messages: - Modify ppc.ad - Set VMContinuations to false on AIX Changes: https://git.openjdk.org/jdk20/pull/4/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=4&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298225 Stats: 5 lines in 2 files changed: 4 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk20/pull/4.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/4/head:pull/4 PR: https://git.openjdk.org/jdk20/pull/4 From qamai at openjdk.org Thu Dec 8 20:25:40 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 8 Dec 2022 20:25:40 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix test Thank everyone for your kind reviews and suggestions. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From qamai at openjdk.org Thu Dec 8 20:28:35 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 8 Dec 2022 20:28:35 GMT Subject: Integrated: 8292289: [vectorapi] Improve the implementation of VectorTestNode In-Reply-To: References: Message-ID: <-MU2RIz9khfb6sbHP_8eKUMut_PsmYlLteCRL8f2FnE=.8488795b-80be-4ac7-a223-4211740d8800@github.com> On Fri, 12 Aug 2022 13:50:29 GMT, Quan Anh Mai wrote: > This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: > > vptest xmm0, xmm1 > jb if_true > if_false: > > instead of: > > vptest xmm0, xmm1 > setb r10 > movzbl r10 > testl r10 > jne if_true > if_false: > > The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: > > Before After > Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change > ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% > > I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. This pull request has now been integrated. Changeset: 3dfadeeb Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/3dfadeebd023efb03a400f2b2656567a4154421a Stats: 494 lines in 23 files changed: 215 ins; 170 del; 109 mod 8292289: [vectorapi] Improve the implementation of VectorTestNode Reviewed-by: xgong, kvn ------------- PR: https://git.openjdk.org/jdk/pull/9855 From gcao at openjdk.org Thu Dec 8 23:03:12 2022 From: gcao at openjdk.org (Gui Cao) Date: Thu, 8 Dec 2022 23:03:12 GMT Subject: RFR: 8298345: Fix another two C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 03:50:53 GMT, Fei Yang wrote: >> Fix two IR matching tests that failed on RISC-V. >> >> Vector api Node will be matched only when UseRVV is enabled: >> - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java >> - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> - fastdebug on unmatched board without support for RVV >> - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected >> - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected >> - fastdebug with -XX:+UseRVV on QEMU >> - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - The C2 graph generated by the test is as expected >> - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java -The C2 graph generated by the test is as expected >> - fastdebug with -XX:-UseRVV on QEMU >> - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected >> - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected > > Looks reasonable to me. Thanks. @RealFYang @DingliZhang @feilongjiang Thanks for the review. ------------- PR: https://git.openjdk.org/jdk/pull/11577 From dholmes at openjdk.org Fri Dec 9 00:14:49 2022 From: dholmes at openjdk.org (David Holmes) Date: Fri, 9 Dec 2022 00:14:49 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:37:17 GMT, Sergey Tsypanov wrote: > Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like > > void iterate(T[] items) { > if (items.length == 0) { > return; > } > for (T item : items) { > //... > } > } > > Here > > if (items.length == 0) { > return; > } > > is redundant and can be removed as length check is performed by for-each loop. These all seem fine to me. You can count this as the review for hotspot and serviceability. :) ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.org/jdk/pull/11589 From amenkov at openjdk.org Fri Dec 9 00:14:50 2022 From: amenkov at openjdk.org (Alex Menkov) Date: Fri, 9 Dec 2022 00:14:50 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:37:17 GMT, Sergey Tsypanov wrote: > Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like > > void iterate(T[] items) { > if (items.length == 0) { > return; > } > for (T item : items) { > //... > } > } > > Here > > if (items.length == 0) { > return; > } > > is redundant and can be removed as length check is performed by for-each loop. Marked as reviewed by amenkov (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11589 From serb at openjdk.org Fri Dec 9 00:46:18 2022 From: serb at openjdk.org (Sergey Bylokhov) Date: Fri, 9 Dec 2022 00:46:18 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:37:17 GMT, Sergey Tsypanov wrote: > Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like > > void iterate(T[] items) { > if (items.length == 0) { > return; > } > for (T item : items) { > //... > } > } > > Here > > if (items.length == 0) { > return; > } > > is redundant and can be removed as length check is performed by for-each loop. The "client" changes in src/java.desktop looks fine ------------- Marked as reviewed by serb (Reviewer). PR: https://git.openjdk.org/jdk/pull/11589 From epeter at openjdk.org Fri Dec 9 05:52:28 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Dec 2022 05:52:28 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 10:40:18 GMT, Tobias Hartmann wrote: >> **Targetted for JDK21** >> >> The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. >> >> We would read `succ` from `_succs[1]`. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 >> >> Then overwrite `_succs[0]` with `succ`, and shorten the array. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 >> >> And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 >> >> **Solution** >> Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). >> >> **Why did we never hit this bug before?** >> Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. >> Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. >> >> Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. >> We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. >> >> ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) > > test/hotspot/jtreg/compiler/loopopts/TestPhaseCFGNeverBranchToGotoMain.java line 28: > >> 26: * @bug 8296389 >> 27: * @summary Peeling of Irreducible loop can lead to NeverBranch being visited from either side >> 28: * @run main/othervm -Xcomp -Xbatch -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 > > Suggestion: > > * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 > > > `-Xcomp` implies `-Xbatch` ? > test/hotspot/jtreg/compiler/loopopts/TestPhaseCFGNeverBranchToGotoMain.java line 38: > >> 36: * @compile TestPhaseCFGNeverBranchToGoto.jasm >> 37: * @summary Peeling of Irreducible loop can lead to NeverBranch being visited from either side >> 38: * @run main/othervm -Xcomp -Xbatch -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 > > Suggestion: > > * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:PerMethodTrapLimit=0 ? > test/hotspot/jtreg/compiler/loopopts/TestPhaseCFGNeverBranchToGotoMain.java line 48: > >> 46: test(false, false); >> 47: } >> 48: public static void test(boolean flag1, boolean flag2) { > > Suggestion: > > } > > public static void test(boolean flag1, boolean flag2) { ? ------------- PR: https://git.openjdk.org/jdk/pull/11481 From vtewari at openjdk.org Fri Dec 9 06:24:54 2022 From: vtewari at openjdk.org (Vyom Tewari) Date: Fri, 9 Dec 2022 06:24:54 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:37:17 GMT, Sergey Tsypanov wrote: > Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like > > void iterate(T[] items) { > if (items.length == 0) { > return; > } > for (T item : items) { > //... > } > } > > Here > > if (items.length == 0) { > return; > } > > is redundant and can be removed as length check is performed by for-each loop. java.base changes looks ok to me. ------------- Marked as reviewed by vtewari (Committer). PR: https://git.openjdk.org/jdk/pull/11589 From epeter at openjdk.org Fri Dec 9 07:07:14 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Dec 2022 07:07:14 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors [v2] In-Reply-To: References: Message-ID: > **Targetted for JDK21** > > The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. > > We would read `succ` from `_succs[1]`. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 > > Then overwrite `_succs[0]` with `succ`, and shorten the array. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 > > And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 > > **Solution** > Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). > > **Why did we never hit this bug before?** > Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. > Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. > > Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. > We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. > > ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Refactoring a bit after review suggestions - Merge branch 'master' into JDK-8296389 - replace tabs with spaces - 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11481/files - new: https://git.openjdk.org/jdk/pull/11481/files/c826a8ff..f1d25d0e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11481&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11481&range=00-01 Stats: 41228 lines in 988 files changed: 25126 ins; 9939 del; 6163 mod Patch: https://git.openjdk.org/jdk/pull/11481.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11481/head:pull/11481 PR: https://git.openjdk.org/jdk/pull/11481 From epeter at openjdk.org Fri Dec 9 07:07:16 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Dec 2022 07:07:16 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors [v2] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 11:33:12 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Refactoring a bit after review suggestions >> - Merge branch 'master' into JDK-8296389 >> - replace tabs with spaces >> - 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors > > src/hotspot/share/opto/block.cpp line 626: > >> 624: int end_idx = b->end_idx(); >> 625: int taken_idx = b->get_node(end_idx+1)->as_Proj()->_con; >> 626: ProjNode* alwaysTaken = b->get_node(end_idx + 1 + taken_idx)->as_Proj(); > > I find this code rather confusing. Since it's guaranteed that `alwaysTaken->_con == 0`, can't we simply do something like this? > > ProjNode* alwaysTaken = b->get_node(end_idx)->as_MultiBranch()->proj_out(0); > Block* succ == get_block_for_node(alwaysTaken->unique_ctrl_out_or_null()); ? ------------- PR: https://git.openjdk.org/jdk/pull/11481 From epeter at openjdk.org Fri Dec 9 07:11:08 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Dec 2022 07:11:08 GMT Subject: RFR: 8257197: Add additional verification code to PhaseCCP [v4] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:27:37 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review >> >> Thanks Christian >> >> Co-authored-by: Christian Hagedorn > > Looks good! Thanks @TobiHartmann @chhagedorn @vnkozlov for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11529 From epeter at openjdk.org Fri Dec 9 07:16:24 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Dec 2022 07:16:24 GMT Subject: Integrated: 8257197: Add additional verification code to PhaseCCP In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 08:02:12 GMT, Emanuel Peter wrote: > We have had many bugs that could be tracked back to missing optimizations during PhaseCCP. > Often the problem is that a node `x` has an input (of input of input) `y` which is modified, but `y` does not notify `x` (does not push it to the worklist). For one this is a missed opportunity to optimize, but it can also lead to failed assumptions later: often we assume that all that can be optimize is already optimized. > This verification helps us debug faster, and can also help when `Value` optimizations suddenly do further-reaching traversals, which would require further-reaching notifications. > > Sadly, the verification is not total: we have some exceptions. Especially for `LoadNode`, which perform a walk up the memory inputs, which can go arbitrarily far, to look for relates `StoreNode`s. Notification would thus have to put very many nodes on the worklist. The question is if the potential additional optimization is worth the compile-time. If we decided yes, then one we might want to implement a listener-style notification: when a node visits inputs during `Value`, it could subscribe to all (or the relevant) visited input-nodes for future updates. Currently, we just mostly do fixed 1-hop or 2-hop notification of output nodes. > > FYI: I plan to do a similar verification, and a refactoring of `PhaseCCP::push_child_nodes_to_worklist` and `PhaseIterGVN::add_users_to_worklist` in [JDK-8298094](https://bugs.openjdk.org/browse/JDK-8298094). This pull request has now been integrated. Changeset: 11aece21 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/11aece21f4eb5b18af357b265bc27b80bcdbfbcb Stats: 52 lines in 2 files changed: 52 ins; 0 del; 0 mod 8257197: Add additional verification code to PhaseCCP Reviewed-by: chagedorn, kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11529 From chagedorn at openjdk.org Fri Dec 9 07:26:54 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 9 Dec 2022 07:26:54 GMT Subject: RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 16:04:16 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. The fix looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11596 From roland at openjdk.org Fri Dec 9 08:45:19 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 9 Dec 2022 08:45:19 GMT Subject: RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 17:56:08 GMT, Vladimir Kozlov wrote: >> After some unrolling, when C2 runs loop opts with split if enabled >> after CCP, the limit of the main loop of the counted loop (the second >> loop in the test) is: limit - 3 >> >> That commons with the limit - 3 returned from the first loop. limit - >> 3 is thus in the first loop's body but only used outside of the >> loop. It has 3 uses: The return in the first loop, the >> OpaqueZeroTripGuard and loop exit conditionof the main loop. In the >> same pass of loop opts, limit-3 is cloned out of the loop 3 times for >> its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard >> and the loop exit condition now use 2 different nodes (until they >> common at next igvn), the assert fires. >> >> The fix I propose restores the behavior before the introduction of >> OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 >> use. > > The fix is good but I suggest to rebase it to new https://github.com/openjdk/jdk20 fork before we start testing it. Thanks for looking at this @vnkozlov @chhagedorn Let me close that one and open jdk 20 PR ------------- PR: https://git.openjdk.org/jdk/pull/11596 From roland at openjdk.org Fri Dec 9 08:45:22 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 9 Dec 2022 08:45:22 GMT Subject: Withdrawn: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 16:04:16 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/11596 From roland at openjdk.org Fri Dec 9 08:51:16 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 9 Dec 2022 08:51:16 GMT Subject: [jdk20] RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed Message-ID: After some unrolling, when C2 runs loop opts with split if enabled after CCP, the limit of the main loop of the counted loop (the second loop in the test) is: limit - 3 That commons with the limit - 3 returned from the first loop. limit - 3 is thus in the first loop's body but only used outside of the loop. It has 3 uses: The return in the first loop, the OpaqueZeroTripGuard and loop exit conditionof the main loop. In the same pass of loop opts, limit-3 is cloned out of the loop 3 times for its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard and the loop exit condition now use 2 different nodes (until they common at next igvn), the assert fires. The fix I propose restores the behavior before the introduction of OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 use. ------------- Commit messages: - fix - test Changes: https://git.openjdk.org/jdk20/pull/6/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=6&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298353 Stats: 80 lines in 2 files changed: 79 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk20/pull/6.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/6/head:pull/6 PR: https://git.openjdk.org/jdk20/pull/6 From chagedorn at openjdk.org Fri Dec 9 08:57:18 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 9 Dec 2022 08:57:18 GMT Subject: [jdk20] RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: <5nr4_aKbkmhbZUTuSZILhiGgiANHpyhmrxmrR-jwWW4=.0e425450-ecc6-4025-886a-f90069b72e33@github.com> On Fri, 9 Dec 2022 08:43:20 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. Marked as reviewed by chagedorn (Reviewer). ------------- PR: https://git.openjdk.org/jdk20/pull/6 From epeter at openjdk.org Fri Dec 9 09:02:27 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Dec 2022 09:02:27 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors [v3] In-Reply-To: References: Message-ID: > The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. > > We would read `succ` from `_succs[1]`. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 > > Then overwrite `_succs[0]` with `succ`, and shorten the array. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 > > And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 > > **Solution** > Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). > > **Refactoring: added class id for NeverBranch** > I also added the class id for NeverBranch, and replaced all `Op_NeverBranch` checks with `is_NeverBranch()`. > > **Why did we never hit this bug before?** > Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. > Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. > > Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. > We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. > > ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: simplify with get_block_for_node ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11481/files - new: https://git.openjdk.org/jdk/pull/11481/files/f1d25d0e..55ce9581 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11481&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11481&range=01-02 Stats: 9 lines in 1 file changed: 0 ins; 5 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/11481.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11481/head:pull/11481 PR: https://git.openjdk.org/jdk/pull/11481 From thartmann at openjdk.org Fri Dec 9 10:01:26 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 9 Dec 2022 10:01:26 GMT Subject: [jdk20] RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 08:43:20 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. The fix looks good and trivial. We'll run some quick testing before integration. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk20/pull/6 From thartmann at openjdk.org Fri Dec 9 10:11:05 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 9 Dec 2022 10:11:05 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors [v3] In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 09:02:27 GMT, Emanuel Peter wrote: >> The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. >> >> We would read `succ` from `_succs[1]`. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 >> >> Then overwrite `_succs[0]` with `succ`, and shorten the array. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 >> >> And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 >> >> **Solution** >> Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). >> >> **Refactoring: added class id for NeverBranch** >> I also added the class id for NeverBranch, and replaced all `Op_NeverBranch` checks with `is_NeverBranch()`. >> >> **Why did we never hit this bug before?** >> Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. >> Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. >> >> Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. >> We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. >> >> ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > simplify with get_block_for_node Much better, thanks for making these changes. Looks good. src/hotspot/share/opto/cfgnode.hpp line 595: > 593: public: > 594: NeverBranchNode( Node *ctrl ) : MultiBranchNode(1) { > 595: init_req(0,ctrl); Suggestion: NeverBranchNode(Node* ctrl) : MultiBranchNode(1) { init_req(0, ctrl); ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11481 From rrich at openjdk.org Fri Dec 9 10:16:24 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 9 Dec 2022 10:16:24 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 18:56:10 GMT, Tyler Steele wrote: > This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. Looks reasonable. Thanks, Richard. ------------- Marked as reviewed by rrich (Reviewer). PR: https://git.openjdk.org/jdk20/pull/4 From thartmann at openjdk.org Fri Dec 9 10:51:55 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 9 Dec 2022 10:51:55 GMT Subject: RFR: 8298320: Typo in the comment block of catch_inline_exception In-Reply-To: <28BsntHiy-hfTs75vt4A6RD3g5RwcKu3EEClii24P1M=.7fab17bc-abd4-405f-b17a-016f615e6a79@github.com> References: <28BsntHiy-hfTs75vt4A6RD3g5RwcKu3EEClii24P1M=.7fab17bc-abd4-405f-b17a-016f615e6a79@github.com> Message-ID: On Thu, 8 Dec 2022 17:09:03 GMT, Xin Liu wrote: > The following comment makes reference to 'Deutsch-Shiffman'. I believe it's a typo. It should be 'Schiffman' if the author intent to cite this paper: >> Deutsch, L. Peter, and Allan M. Schiffman. "Efficient implementation of the Smalltalk-80 system." Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. 1984. > > I ask 'Deutsch-Shiffman' to google and this is what google answers me. seems reasonable. > > > // Case 2: we have some handlers, with loaded exception klasses that have > // no subklasses. We do a Deutsch-Shiffman style type-check on the incoming > // exception oop and branch to the handler directly. > ... > void Parse::catch_inline_exceptions(SafePointNode* ex_map) { Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11598 From epeter at openjdk.org Fri Dec 9 11:14:29 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Dec 2022 11:14:29 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors [v4] In-Reply-To: References: Message-ID: > The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. > > We would read `succ` from `_succs[1]`. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 > > Then overwrite `_succs[0]` with `succ`, and shorten the array. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 > > And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 > > **Solution** > Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). > > **Refactoring: added class id for NeverBranch** > I also added the class id for NeverBranch, and replaced all `Op_NeverBranch` checks with `is_NeverBranch()`. > > **Why did we never hit this bug before?** > Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. > Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. > > Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. > We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. > > ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/cfgnode.hpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11481/files - new: https://git.openjdk.org/jdk/pull/11481/files/55ce9581..e119e963 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11481&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11481&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11481.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11481/head:pull/11481 PR: https://git.openjdk.org/jdk/pull/11481 From rkennke at openjdk.org Fri Dec 9 11:21:47 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 9 Dec 2022 11:21:47 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v13] In-Reply-To: References: Message-ID: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 34 additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8297036 - Fix copyrights - Merge remote-tracking branch 'upstream/master' into JDK-8297036 - PPC fixes - Update copyright notices - More renames. Duh - Rename C2CodeStub::size() -> max_size() - Relax size-check in C2CodeStubList::emit() - More RISCV fixes - PPC fix - ... and 24 more: https://git.openjdk.org/jdk/compare/cc36e60e...a91b7045 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/e718ba6f..a91b7045 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=11-12 Stats: 17904 lines in 416 files changed: 13289 ins; 3226 del; 1389 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From mdoerr at openjdk.org Fri Dec 9 11:41:55 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 9 Dec 2022 11:41:55 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX In-Reply-To: References: Message-ID: <5X7YRTse1zk5H9MTfNfNtuneXbnxiqDk16JiqslSjE0=.f26fb480-1516-46a6-8b65-f4b245477bd8@github.com> On Thu, 8 Dec 2022 18:56:10 GMT, Tyler Steele wrote: > This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. Ok, switching it off for AIX makes sense for JDK 20. Do you know what's wrong on that OS? I believe there's not much missing. I recommend trying test/jdk/java/lang/Thread/virtual/stress tests and debugging it on linux PPC64 Big Endian. Maybe the frame header size is the problem which could cause GC to look at the wrong fields for example. (But, you can integrate this fix and continue in JDK 21.) src/hotspot/cpu/ppc/ppc.ad line 14378: > 14376: > 14377: source %{ > 14378: #include "runtime/continuation.hpp" I usually avoid spaces in front of preprocessor directives. But, I guess it's no longer problematic with recent compilers. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk20/pull/4 From rrich at openjdk.org Fri Dec 9 11:58:59 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 9 Dec 2022 11:58:59 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 18:56:10 GMT, Tyler Steele wrote: > This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. My recommendation for tracking down issues would be test/jdk/jdk/internal/vm/Continuation/BasicExt.java with -Xlog:continuations=trace switched on. ------------- PR: https://git.openjdk.org/jdk20/pull/4 From thartmann at openjdk.org Fri Dec 9 12:31:06 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 9 Dec 2022 12:31:06 GMT Subject: [jdk20] RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 08:43:20 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. All tests passed, ship it! :) ------------- PR: https://git.openjdk.org/jdk20/pull/6 From gcao at openjdk.org Fri Dec 9 12:48:25 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 9 Dec 2022 12:48:25 GMT Subject: Integrated: 8298345: Fix another two C2 IR matching tests for RISC-V In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 03:26:38 GMT, Gui Cao wrote: > Fix two IR matching tests that failed on RISC-V. > > Vector api Node will be matched only when UseRVV is enabled: > - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java > - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - fastdebug on unmatched board without support for RVV > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected > - fastdebug with -XX:+UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - The C2 graph generated by the test is as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java -The C2 graph generated by the test is as expected > - fastdebug with -XX:-UseRVV on QEMU > - - test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java - no tests selected as expected > - - test/hotspot/jtreg/compiler/vectorization/TestAutoVecIntMinMax.java - no tests selected as expected This pull request has now been integrated. Changeset: 33d955ad Author: Gui Cao Committer: Julian Waters URL: https://git.openjdk.org/jdk/commit/33d955ad6e46eecd947e958ce295f6a6c348b2a6 Stats: 5 lines in 3 files changed: 3 ins; 0 del; 2 mod 8298345: Fix another two C2 IR matching tests for RISC-V Reviewed-by: fyang, dzhang, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/11577 From stsypanov at openjdk.org Fri Dec 9 12:54:50 2022 From: stsypanov at openjdk.org (Sergey Tsypanov) Date: Fri, 9 Dec 2022 12:54:50 GMT Subject: Integrated: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:37:17 GMT, Sergey Tsypanov wrote: > Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like > > void iterate(T[] items) { > if (items.length == 0) { > return; > } > for (T item : items) { > //... > } > } > > Here > > if (items.length == 0) { > return; > } > > is redundant and can be removed as length check is performed by for-each loop. This pull request has now been integrated. Changeset: e3c6cf8e Author: Sergey Tsypanov Committer: Julian Waters URL: https://git.openjdk.org/jdk/commit/e3c6cf8eaf931d9eb46b429a5ba8d3bbded3728a Stats: 51 lines in 8 files changed: 0 ins; 14 del; 37 mod 8298380: Clean up redundant array length checks in JDK code base Reviewed-by: dholmes, amenkov, serb, vtewari ------------- PR: https://git.openjdk.org/jdk/pull/11589 From eosterlund at openjdk.org Fri Dec 9 13:45:06 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 9 Dec 2022 13:45:06 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v13] In-Reply-To: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> References: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> Message-ID: On Fri, 9 Dec 2022 11:21:47 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 34 additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - Fix copyrights > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - PPC fixes > - Update copyright notices > - More renames. Duh > - Rename C2CodeStub::size() -> max_size() > - Relax size-check in C2CodeStubList::emit() > - More RISCV fixes > - PPC fix > - ... and 24 more: https://git.openjdk.org/jdk/compare/ca09693d...a91b7045 Marked as reviewed by eosterlund (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11188 From xlinzheng at openjdk.org Fri Dec 9 14:31:04 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 9 Dec 2022 14:31:04 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v13] In-Reply-To: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> References: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> Message-ID: On Fri, 9 Dec 2022 11:21:47 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 34 additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - Fix copyrights > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - PPC fixes > - Update copyright notices > - More renames. Duh > - Rename C2CodeStub::size() -> max_size() > - Relax size-check in C2CodeStubList::emit() > - More RISCV fixes > - PPC fix > - ... and 24 more: https://git.openjdk.org/jdk/compare/41b2e937...a91b7045 Hotspot tier1~4 results and other tests with RISC-V fastdebug build on my board look great. (Mainly testing the `riscv-11188-2.txt` diff.) ------------- PR: https://git.openjdk.org/jdk/pull/11188 From fyang at openjdk.org Fri Dec 9 14:36:14 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 9 Dec 2022 14:36:14 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v13] In-Reply-To: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> References: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> Message-ID: On Fri, 9 Dec 2022 11:21:47 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 34 additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - Fix copyrights > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - PPC fixes > - Update copyright notices > - More renames. Duh > - Rename C2CodeStub::size() -> max_size() > - Relax size-check in C2CodeStubList::emit() > - More RISCV fixes > - PPC fix > - ... and 24 more: https://git.openjdk.org/jdk/compare/e7adf27e...a91b7045 Marked as reviewed by fyang (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11188 From rriggs at openjdk.org Fri Dec 9 14:38:03 2022 From: rriggs at openjdk.org (Roger Riggs) Date: Fri, 9 Dec 2022 14:38:03 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:37:17 GMT, Sergey Tsypanov wrote: > Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like > > void iterate(T[] items) { > if (items.length == 0) { > return; > } > for (T item : items) { > //... > } > } > > Here > > if (items.length == 0) { > return; > } > > is redundant and can be removed as length check is performed by for-each loop. @stsypanov , @TheShermanTanker You jumped the gun a bit on the integration and sponsoring. There was no approval for the core-libs parts from a "R"eviewer. ------------- PR: https://git.openjdk.org/jdk/pull/11589 From smonteith at openjdk.org Fri Dec 9 14:38:21 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Fri, 9 Dec 2022 14:38:21 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v4] In-Reply-To: References: Message-ID: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% > > > These methods can bed specifically tested with: > `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` Stuart Monteith has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'openjdk:master' into JDK-8294194 - Merge branch 'openjdk:master' into JDK-8294194 - Update src/hotspot/cpu/aarch64/aarch64.ad Correct slight formatting error. Co-authored-by: Eric Liu - 8294194: Create intrinsics compress and expand The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. Running on an SVE2 enabled system, I ran the following benchmarks: org.openjdk.bench.java.lang.Integers org.openjdk.bench.java.lang.Longs The time for each operation reduced to 56% to 72% of the original run time: Benchmark Result error Unit % against non-SVE2 Integers.expand 2.106 0.011 us/op Integers.expand-SVE 1.431 0.009 us/op 67.95% Longs.expand 2.606 0.006 us/op Longs.expand-SVE 1.46 0.003 us/op 56.02% Integers.compress 1.982 0.004 us/op Integers.compress-SVE 1.427 0.003 us/op 72.00% Longs.compress 2.501 0.002 us/op Longs.compress-SVE 1.441 0.003 us/op 57.62% ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10537/files - new: https://git.openjdk.org/jdk/pull/10537/files/a7484586..dee5d0f8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=02-03 Stats: 82704 lines in 1237 files changed: 40813 ins; 34833 del; 7058 mod Patch: https://git.openjdk.org/jdk/pull/10537.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10537/head:pull/10537 PR: https://git.openjdk.org/jdk/pull/10537 From stsypanov at openjdk.org Fri Dec 9 14:48:02 2022 From: stsypanov at openjdk.org (Sergey Tsypanov) Date: Fri, 9 Dec 2022 14:48:02 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 14:35:47 GMT, Roger Riggs wrote: >> Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like >> >> void iterate(T[] items) { >> if (items.length == 0) { >> return; >> } >> for (T item : items) { >> //... >> } >> } >> >> Here >> >> if (items.length == 0) { >> return; >> } >> >> is redundant and can be removed as length check is performed by for-each loop. > > @stsypanov , @TheShermanTanker You jumped the gun a bit on the integration and sponsoring. There was no approval for the core-libs parts from a "R"eviewer. @RogerRiggs changes are trivial. Should I revert any of them? ------------- PR: https://git.openjdk.org/jdk/pull/11589 From rkennke at openjdk.org Fri Dec 9 14:54:12 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 9 Dec 2022 14:54:12 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v13] In-Reply-To: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> References: <9IaSG4BV-0sLML6pL5zyp-Sg4E0q8H-o5KKuJ9spMvY=.554a8276-0796-4dbf-875b-f75e1b2b1feb@github.com> Message-ID: On Fri, 9 Dec 2022 11:21:47 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 34 additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - Fix copyrights > - Merge remote-tracking branch 'upstream/master' into JDK-8297036 > - PPC fixes > - Update copyright notices > - More renames. Duh > - Rename C2CodeStub::size() -> max_size() > - Relax size-check in C2CodeStubList::emit() > - More RISCV fixes > - PPC fix > - ... and 24 more: https://git.openjdk.org/jdk/compare/13da291b...a91b7045 Thanks all for your help and reviews! GHA is also green, let's ------------- PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Fri Dec 9 14:56:01 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 9 Dec 2022 14:56:01 GMT Subject: Integrated: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 15:03:07 GMT, Roman Kennke wrote: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) This pull request has now been integrated. Changeset: b30b464d Author: Roman Kennke URL: https://git.openjdk.org/jdk/commit/b30b464d054716bbc3d4d70633b740b227b8775d Stats: 908 lines in 21 files changed: 435 ins; 453 del; 20 mod 8297036: Generalize C2 stub mechanism Co-authored-by: Aleksey Shipilev Co-authored-by: Xiaolin Zheng Reviewed-by: eosterlund, kvn, fyang ------------- PR: https://git.openjdk.org/jdk/pull/11188 From tholenstein at openjdk.org Fri Dec 9 15:05:58 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 9 Dec 2022 15:05:58 GMT Subject: [jdk20] RFR: JDK-8289748: C2 compiled code crashes with SIGFPE with -XX:+StressLCM and -XX:+StressGCM Message-ID: <54udVksrhWVx9V1jzxmEM4fNZvZkN_jc3yxR0_8xn-w=.61f08459-3006-4b72-b59e-a3f1bec5f7af@github.com> # Problem `203 CountedLoop` is the post loop of a strip-mined inner loop. In the loop we have `198 ModI` = 1 / tripcount, where tripcount can never be zero in this loop. The tripcount `204 Phi` is further pinned with a `224 CastII`. before `IdealLoopTree::do_remove_empty_loop(...)` now replaces the tripcounter `204 Phi` with the value that the loop will have on the last iteration: `80 Phi` (exact limit) - `Const 1` (stride). Here is where the mistake happens: `222 If` is the zero trip guard of the post loop and prevents that the post loop is executed when the divisor of `198 ModI` would be zero. BUT tripcounter `204 Phi` is removed without checking if it is pinned to the IF with a CastNode. Therefore the `198 ModI` now floats above the `222 if` where it now can be a modulo zero operation. fail # Solution The Solution is to check if the tripcount is pinned with a CastII that carries a dependency. If Yes, we create a new CastII to pin final_iv (exact_ limit - stride) : fix ------------- Commit messages: - remote UseG1GC from test - JDK-8289748: C2 compiled code crashes with SIGFPE with -XX:+StressLCM and -XX:+StressGCM Changes: https://git.openjdk.org/jdk20/pull/8/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=8&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289748 Stats: 64 lines in 2 files changed: 64 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk20/pull/8.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/8/head:pull/8 PR: https://git.openjdk.org/jdk20/pull/8 From kvn at openjdk.org Fri Dec 9 15:32:56 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Dec 2022 15:32:56 GMT Subject: [jdk20] RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 08:43:20 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/6 From roland at openjdk.org Fri Dec 9 15:32:57 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 9 Dec 2022 15:32:57 GMT Subject: [jdk20] RFR: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 12:28:52 GMT, Tobias Hartmann wrote: >> After some unrolling, when C2 runs loop opts with split if enabled >> after CCP, the limit of the main loop of the counted loop (the second >> loop in the test) is: limit - 3 >> >> That commons with the limit - 3 returned from the first loop. limit - >> 3 is thus in the first loop's body but only used outside of the >> loop. It has 3 uses: The return in the first loop, the >> OpaqueZeroTripGuard and loop exit conditionof the main loop. In the >> same pass of loop opts, limit-3 is cloned out of the loop 3 times for >> its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard >> and the loop exit condition now use 2 different nodes (until they >> common at next igvn), the assert fires. >> >> The fix I propose restores the behavior before the introduction of >> OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 >> use. > > All tests passed, ship it! :) @TobiHartmann @vnkozlov @chhagedorn thanks! ------------- PR: https://git.openjdk.org/jdk20/pull/6 From roland at openjdk.org Fri Dec 9 15:36:54 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 9 Dec 2022 15:36:54 GMT Subject: [jdk20] Integrated: 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 08:43:20 GMT, Roland Westrelin wrote: > After some unrolling, when C2 runs loop opts with split if enabled > after CCP, the limit of the main loop of the counted loop (the second > loop in the test) is: limit - 3 > > That commons with the limit - 3 returned from the first loop. limit - > 3 is thus in the first loop's body but only used outside of the > loop. It has 3 uses: The return in the first loop, the > OpaqueZeroTripGuard and loop exit conditionof the main loop. In the > same pass of loop opts, limit-3 is cloned out of the loop 3 times for > its 3 uses and unrolling is attempted. Because the OpaqueZeroTripGuard > and the loop exit condition now use 2 different nodes (until they > common at next igvn), the assert fires. > > The fix I propose restores the behavior before the introduction of > OpaqueZeroTripGuard which is to not sink a node if it has an Opaque1 > use. This pull request has now been integrated. Changeset: b7b996cb Author: Roland Westrelin URL: https://git.openjdk.org/jdk20/commit/b7b996cb9475f8191d4085a2f7f68187b6f015d5 Stats: 80 lines in 2 files changed: 79 ins; 0 del; 1 mod 8298353: C2 fails with assert(opaq->outcnt() == 1 && opaq->in(1) == limit) failed Reviewed-by: chagedorn, thartmann, kvn ------------- PR: https://git.openjdk.org/jdk20/pull/6 From chagedorn at openjdk.org Fri Dec 9 15:37:49 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 9 Dec 2022 15:37:49 GMT Subject: [jdk20] RFR: JDK-8289748: C2 compiled code crashes with SIGFPE with -XX:+StressLCM and -XX:+StressGCM In-Reply-To: <54udVksrhWVx9V1jzxmEM4fNZvZkN_jc3yxR0_8xn-w=.61f08459-3006-4b72-b59e-a3f1bec5f7af@github.com> References: <54udVksrhWVx9V1jzxmEM4fNZvZkN_jc3yxR0_8xn-w=.61f08459-3006-4b72-b59e-a3f1bec5f7af@github.com> Message-ID: <5JsnVo2fZa49sM9ONpZbY44OGnhEY2DYuUSCUUIgdYQ=.e6ca48ac-63c5-4288-92f7-97b084d78199@github.com> On Fri, 9 Dec 2022 11:12:35 GMT, Tobias Holenstein wrote: > # Problem > > `203 CountedLoop` is the post loop of a strip-mined inner loop. In the loop we have `198 ModI` = 1 / tripcount, where tripcount can never be zero in this loop. The tripcount `204 Phi` is further pinned with a `224 CastII`. > > before > > `IdealLoopTree::do_remove_empty_loop(...)` now replaces the tripcounter `204 Phi` with the value that the loop will have on the last iteration: `80 Phi` (exact limit) - `Const 1` (stride). Here is where the mistake happens: > `222 If` is the zero trip guard of the post loop and prevents that the post loop is executed when the divisor of `198 ModI` would be zero. BUT tripcounter `204 Phi` is removed without checking if it is pinned to the IF with a CastNode. Therefore the `198 ModI` now floats above the `222 if` where it now can be a modulo zero operation. > > fail > > # Solution > The Solution is to check if the tripcount is pinned with a CastII that carries a dependency. If Yes, we create a new CastII to pin final_iv (exact_ limit - stride) : > fix Otherwise, the fix looks good to me! src/hotspot/share/opto/loopTransform.cpp line 3694: > 3692: if (castii->is_CastII() && castii->as_CastII()->carry_dependency()) { > 3693: Node* cast = ConstraintCastNode::make(castii->in(0), exact_limit, phase->_igvn.type(exact_limit), ConstraintCastNode::UnconditionalDependency, T_INT); > 3694: phase->_igvn.register_new_node_with_optimizer(cast); We should use `register_new_node` here to also set ctrl. Suggestion: phase->register_new_node(cast, castii->in(0)); ------------- Changes requested by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/8 From rriggs at openjdk.org Fri Dec 9 15:54:00 2022 From: rriggs at openjdk.org (Roger Riggs) Date: Fri, 9 Dec 2022 15:54:00 GMT Subject: RFR: 8298380: Clean up redundant array length checks in JDK code base In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 12:37:17 GMT, Sergey Tsypanov wrote: > Newer version of IntelliJ IDEA introduces new [inspection](https://youtrack.jetbrains.com/issue/IDEA-301797/IDEA-should-report-redundant-array-length-check-in-certain-cases) detecting redundant array length check in snippets like > > void iterate(T[] items) { > if (items.length == 0) { > return; > } > for (T item : items) { > //... > } > } > > Here > > if (items.length == 0) { > return; > } > > is redundant and can be removed as length check is performed by for-each loop. No, just a reminder to be through in the process. ------------- PR: https://git.openjdk.org/jdk/pull/11589 From tsteele at openjdk.org Fri Dec 9 15:53:52 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Fri, 9 Dec 2022 15:53:52 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 18:56:10 GMT, Tyler Steele wrote: > This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. Thanks for the reviews! --- > My recommendation for tracking down issues would be test/jdk/jdk/internal/vm/Continuation/BasicExt.java with -Xlog:continuations=trace switched on. and > Do you know what's wrong on that OS? I believe there's not much missing. I recommend trying test/jdk/java/lang/Thread/virtual/stress tests and debugging it on linux PPC64 Big Endian. Thanks for the suggestions, I think they will help :-). I have been examining the failures, but not with continuations=trace, and on AIX rather than Linux/PPC64 big-endian. All in all, I am surprised at how well Richard's changes for PPC64-LE are working in a different ABI than they were implemented for. We'll see how much progress I can make before 19 Jan :crossed_fingers: ------------- PR: https://git.openjdk.org/jdk20/pull/4 From tsteele at openjdk.org Fri Dec 9 15:59:59 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Fri, 9 Dec 2022 15:59:59 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v2] In-Reply-To: <5X7YRTse1zk5H9MTfNfNtuneXbnxiqDk16JiqslSjE0=.f26fb480-1516-46a6-8b65-f4b245477bd8@github.com> References: <5X7YRTse1zk5H9MTfNfNtuneXbnxiqDk16JiqslSjE0=.f26fb480-1516-46a6-8b65-f4b245477bd8@github.com> Message-ID: On Fri, 9 Dec 2022 11:19:33 GMT, Martin Doerr wrote: >> Tyler Steele has updated the pull request incrementally with one additional commit since the last revision: >> >> Removes indenting before #include macro in ppc.ad > > src/hotspot/cpu/ppc/ppc.ad line 14378: > >> 14376: >> 14377: source %{ >> 14378: #include "runtime/continuation.hpp" > > I usually avoid spaces in front of preprocessor directives. But, I guess it's no longer problematic with recent compilers. Good note. I've made this change to keep it consistent. ------------- PR: https://git.openjdk.org/jdk20/pull/4 From tsteele at openjdk.org Fri Dec 9 15:59:58 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Fri, 9 Dec 2022 15:59:58 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v2] In-Reply-To: References: Message-ID: <4c4WWuIvRMXsPEe8GCL_TM8In91IJKBasaWyDecCnHc=.72884ce7-0f41-4b79-b9b7-f8f5fec9585f@github.com> > This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. Tyler Steele has updated the pull request incrementally with one additional commit since the last revision: Removes indenting before #include macro in ppc.ad ------------- Changes: - all: https://git.openjdk.org/jdk20/pull/4/files - new: https://git.openjdk.org/jdk20/pull/4/files/74c36006..d3c9a764 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk20&pr=4&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk20&pr=4&range=00-01 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk20/pull/4.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/4/head:pull/4 PR: https://git.openjdk.org/jdk20/pull/4 From mdoerr at openjdk.org Fri Dec 9 16:02:07 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 9 Dec 2022 16:02:07 GMT Subject: [jdk20] RFR: 8298225: [AIX] Disable PPC64LE continuations on AIX [v2] In-Reply-To: <4c4WWuIvRMXsPEe8GCL_TM8In91IJKBasaWyDecCnHc=.72884ce7-0f41-4b79-b9b7-f8f5fec9585f@github.com> References: <4c4WWuIvRMXsPEe8GCL_TM8In91IJKBasaWyDecCnHc=.72884ce7-0f41-4b79-b9b7-f8f5fec9585f@github.com> Message-ID: <1bcBIbkkmPk5AL8eXpsYmCtQxXQ_Kp9bd0CQkx8oqdo=.80d1e557-3107-4303-9b5d-0e43fa52080f@github.com> On Fri, 9 Dec 2022 15:59:58 GMT, Tyler Steele wrote: >> This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. >> >> Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. > > Tyler Steele has updated the pull request incrementally with one additional commit since the last revision: > > Removes indenting before #include macro in ppc.ad Marked as reviewed by mdoerr (Reviewer). ------------- PR: https://git.openjdk.org/jdk20/pull/4 From xliu at openjdk.org Fri Dec 9 16:53:38 2022 From: xliu at openjdk.org (Xin Liu) Date: Fri, 9 Dec 2022 16:53:38 GMT Subject: Integrated: 8298320: Typo in the comment block of catch_inline_exception In-Reply-To: <28BsntHiy-hfTs75vt4A6RD3g5RwcKu3EEClii24P1M=.7fab17bc-abd4-405f-b17a-016f615e6a79@github.com> References: <28BsntHiy-hfTs75vt4A6RD3g5RwcKu3EEClii24P1M=.7fab17bc-abd4-405f-b17a-016f615e6a79@github.com> Message-ID: On Thu, 8 Dec 2022 17:09:03 GMT, Xin Liu wrote: > The following comment makes reference to 'Deutsch-Shiffman'. I believe it's a typo. It should be 'Schiffman' if the author intent to cite this paper: >> Deutsch, L. Peter, and Allan M. Schiffman. "Efficient implementation of the Smalltalk-80 system." Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. 1984. > > I ask 'Deutsch-Shiffman' to google and this is what google answers me. seems reasonable. > > > // Case 2: we have some handlers, with loaded exception klasses that have > // no subklasses. We do a Deutsch-Shiffman style type-check on the incoming > // exception oop and branch to the handler directly. > ... > void Parse::catch_inline_exceptions(SafePointNode* ex_map) { This pull request has now been integrated. Changeset: 93465354 Author: Xin Liu URL: https://git.openjdk.org/jdk/commit/9346535415b158aaaa679ef8c3c147595b5206e9 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8298320: Typo in the comment block of catch_inline_exception Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11598 From tsteele at openjdk.org Fri Dec 9 17:06:55 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Fri, 9 Dec 2022 17:06:55 GMT Subject: [jdk20] Integrated: 8298225: [AIX] Disable PPC64LE continuations on AIX In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 18:56:10 GMT, Tyler Steele wrote: > This small change adds an import to ppc.ad to allow it to find Contiuations::enabled, and sets VMContinuations to false on AIX. > > Thanks to @TheRealMDoerr for suggestions to a previous version of this PR. This pull request has now been integrated. Changeset: a8946490 Author: Tyler Steele URL: https://git.openjdk.org/jdk20/commit/a8946490e2b362d241c61cc459dbaba93fc93ca4 Stats: 7 lines in 2 files changed: 6 ins; 0 del; 1 mod 8298225: [AIX] Disable PPC64LE continuations on AIX Reviewed-by: rrich, mdoerr ------------- PR: https://git.openjdk.org/jdk20/pull/4 From kvn at openjdk.org Fri Dec 9 18:21:10 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Dec 2022 18:21:10 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v13] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 14:29:27 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with seven additional commits since the last revision: > > - change julong to uint64_t > - uint > - various fixes > - add constexpr > - add constexpr > - add message to static_assert > - missing powerOfTwo.hpp Changes seem fine. I will start long testing to make sure everything is good. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From jrose at openjdk.org Fri Dec 9 20:27:09 2022 From: jrose at openjdk.org (John R Rose) Date: Fri, 9 Dec 2022 20:27:09 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v13] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 14:29:27 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with seven additional commits since the last revision: > > - change julong to uint64_t > - uint > - various fixes > - add constexpr > - add constexpr > - add message to static_assert > - missing powerOfTwo.hpp I like the way this is shaping up. Thank you for doing the gtests; they provide significant protection against hard-to-track JIT bugs. I suggest hardening the gtests a litte more, by actually performing the arithmetic in question in the gtest. For example, test_magic_int_divide could loop through a few thousand numbers performing signed division both ways and verifying that the answers agree. Suggestion for test numbers: critical values (min, max, zero, selected multiples of d), plus or minus 0, 1, ? N (some small N like 3). Selected multiples of d should span the full dynamic range, so 0 +/- dK, max - dK, min + dK, maybe for about 20 K in 0,1,2,4,8,16,20,?,56,60. That's about 300 numbers. Maybe also sums of pairs of the preceding multiples (there are about 200 such pairs), offset by small N from zero, min, and max, for a total of about 3000 division operations to verify. The gtest should also try a an arbitrary set of additional d values, without the need to quote a magic constant obtained from a C++ compiler. If we try a few hundred extra d values (of varying shapes and sizes), the gtest will complete after doing only a few million divisions. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From kbarrett at openjdk.org Fri Dec 9 22:11:41 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 9 Dec 2022 22:11:41 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs Message-ID: Please review this change to construction and copying of the Relocation and RelocationHolder classes, to eliminate some questionable C++ usage. The array type for RelocationHandle::_relocbuf is changed from void* to char, because using a char array for raw memory is countenanced by the standard, while not so much for an array of void*. The desired alignment is maintained via a union, since using alignas is not (yet) permitted in HotSpot code. There is also now a comment discussing the use of _relocbuf in more detail, including some areas of continued sketchiness wrto standard conformance and reliance on implementation dependent behavior. No longer use trivial copy and assignment for RelocationHolder, since that isn't technically correct. The Relocation in the holder is not trivially copyable, since it is polymorphic. It seemed to work in practice with the supported compilers, but we shouldn't (and don't need to) rely on it. Instead we have a new virtual function Relocation::copy_into that copies the most derived object into the holder's _relocbuf using placement new. Eliminated the implict conversion constructor from Relocation to holder that wordwise copied (to possibly beyond the end of) the Relocation into the holder's _relocbuf. We could have implemented this more carefully with the new approach (using copy_into), but we don't actually need this conversion. The only use of it was actually a wasted copy (in assembler_x86.cpp). Eliminated the use of placement new syntax via operator new with a holder argument to copy a Resource object into a holder. This included runtime verification that the size of the object is not greater than the size of _relocbuf; we now do corresponding verification at compile-time. This also included an incorrect attempt at a runtime check that the Relocation base class would be at the same address as the derived class being constructed; we now perform that check correctly. We also discuss in a comment the layout assumption being made (that isn't required by the standard but is provided by all supported compilers), and what to do if we encounter a compiler that behaves differently. Eliminated the idiom of making a default-constructed holder and then overwriting its held relocation with a newly constructed one, using the afore mentioned (and eliminated) operator new. Instead, RelocationHolder now has a factory function template (construct) for creating holders with a given resource type, constructed using provided arguments. (The arguments are taken as const-ref rather than using perfect forwarding, as the tools for the latter are not (yet) approved for use in HotSpot. Using const-ref is good enough in this case.) Describe and verify other assumptions being made, such as all Relocation classes being trivially destructible. Testing: mach5 tier1-5 Future work: * RelocationHolder::reloc isn't const-correct. Making it so will require adjustment of some callers. I'll follow up with an RFE to address this. * Relocation classes have many virtual function overrides that are unmarked. I'll follow up with an RFE to add "override" specifiers. Potential issue: The removal of RelocationHolder(Relocation*) might not work for some platforms. I've tested on platforms supported by Oracle (where there was only one (mistaken) use). There might be uses by other platforms. ------------- Commit messages: - fix constructors and assigns Changes: https://git.openjdk.org/jdk/pull/11618/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8160404 Stats: 260 lines in 3 files changed: 147 ins; 51 del; 62 mod Patch: https://git.openjdk.org/jdk/pull/11618.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11618/head:pull/11618 PR: https://git.openjdk.org/jdk/pull/11618 From kvn at openjdk.org Fri Dec 9 22:56:56 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Dec 2022 22:56:56 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs In-Reply-To: References: Message-ID: <4eaONf9gghB1Ho7wdHqz3QygXMYVckEVhBevD7xRaLg=.608654c6-f08e-4599-95df-8259742d1a35@github.com> On Fri, 9 Dec 2022 22:00:59 GMT, Kim Barrett wrote: > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > resource type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Nice cleanup. Thank you, Kim. Did you tested it in mach5? ------------- PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Fri Dec 9 23:06:01 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 9 Dec 2022 23:06:01 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs In-Reply-To: <4eaONf9gghB1Ho7wdHqz3QygXMYVckEVhBevD7xRaLg=.608654c6-f08e-4599-95df-8259742d1a35@github.com> References: <4eaONf9gghB1Ho7wdHqz3QygXMYVckEVhBevD7xRaLg=.608654c6-f08e-4599-95df-8259742d1a35@github.com> Message-ID: <04GXgM-v2-4LXssrHg1aG5rQ7d46fVjCWSkLUZ2gRKk=.2573ba2d-d485-4f2f-a01b-70bbe2e28cf0@github.com> On Fri, 9 Dec 2022 22:54:52 GMT, Vladimir Kozlov wrote: > Did you tested it in mach5? Yes, tier1-5, as mentioned in the PR description (though somewhat buried in there). ------------- PR: https://git.openjdk.org/jdk/pull/11618 From kvn at openjdk.org Fri Dec 9 23:25:55 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Dec 2022 23:25:55 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 22:00:59 GMT, Kim Barrett wrote: > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > resource type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Good. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11618 From kvn at openjdk.org Sun Dec 11 04:23:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 11 Dec 2022 04:23:02 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v13] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 14:29:27 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with seven additional commits since the last revision: > > - change julong to uint64_t > - uint > - various fixes > - add constexpr > - add constexpr > - add message to static_assert > - missing powerOfTwo.hpp My tier1-7 and stress testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9947 From smonteith at openjdk.org Mon Dec 12 09:43:27 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Mon, 12 Dec 2022 09:43:27 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v5] In-Reply-To: References: Message-ID: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% > > > These methods can bed specifically tested with: > `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` Stuart Monteith has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'openjdk:master' into JDK-8294194 - Merge branch 'openjdk:master' into JDK-8294194 - Merge branch 'openjdk:master' into JDK-8294194 - Update src/hotspot/cpu/aarch64/aarch64.ad Correct slight formatting error. Co-authored-by: Eric Liu - 8294194: Create intrinsics compress and expand The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. Running on an SVE2 enabled system, I ran the following benchmarks: org.openjdk.bench.java.lang.Integers org.openjdk.bench.java.lang.Longs The time for each operation reduced to 56% to 72% of the original run time: Benchmark Result error Unit % against non-SVE2 Integers.expand 2.106 0.011 us/op Integers.expand-SVE 1.431 0.009 us/op 67.95% Longs.expand 2.606 0.006 us/op Longs.expand-SVE 1.46 0.003 us/op 56.02% Integers.compress 1.982 0.004 us/op Integers.compress-SVE 1.427 0.003 us/op 72.00% Longs.compress 2.501 0.002 us/op Longs.compress-SVE 1.441 0.003 us/op 57.62% ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10537/files - new: https://git.openjdk.org/jdk/pull/10537/files/dee5d0f8..01bcc4e4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=03-04 Stats: 2088 lines in 83 files changed: 1396 ins; 477 del; 215 mod Patch: https://git.openjdk.org/jdk/pull/10537.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10537/head:pull/10537 PR: https://git.openjdk.org/jdk/pull/10537 From ayang at openjdk.org Mon Dec 12 11:20:01 2022 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Mon, 12 Dec 2022 11:20:01 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax [v3] In-Reply-To: <0fGmbCvEUHokqoFR0ndPwd7yA5I_SS-MlBneCPP3LqY=.c4b67306-e5bb-4af5-8875-adba9ea9d0f1@github.com> References: <0fGmbCvEUHokqoFR0ndPwd7yA5I_SS-MlBneCPP3LqY=.c4b67306-e5bb-4af5-8875-adba9ea9d0f1@github.com> Message-ID: On Thu, 8 Dec 2022 09:01:08 GMT, Axel Boldt-Christmas wrote: >> Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. >> >> The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. >> >> This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. >> >> The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. >> >> There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). >> >> It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. >> >> I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: >> * Is there some other way of expressing in the .ad file that a memory input should not share some register? >> * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. >> * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? >> >> Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) > > Axel Boldt-Christmas has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Remove problem listed tests > - Merge remote-tracking branch 'upstream_jdk/master' into JDK-8297235 > - indirect zXChgP as well > - indirect alternative > - JDK-8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax Marked as reviewed by ayang (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11410 From stefank at openjdk.org Mon Dec 12 11:50:48 2022 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 12 Dec 2022 11:50:48 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs In-Reply-To: References: Message-ID: <35ANsb9QEdMZquD20YNzAk6POx4LQzSuBYYcs9AjzrE=.ebd5b981-5768-4295-af63-0db86887b9fb@github.com> On Fri, 9 Dec 2022 22:00:59 GMT, Kim Barrett wrote: > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > resource type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Not a review, but a comment about the style for including system includes. src/hotspot/share/code/relocInfo.cpp line 39: > 37: #include "utilities/copy.hpp" > 38: #include > 39: #include Please add an empty blank line between HotSpot includes and system includes. We don't explicitly state this, but this was the style that we generated when IncludeDB was removed. src/hotspot/share/code/relocInfo.hpp line 33: > 31: #include "utilities/globalDefinitions.hpp" > 32: #include "utilities/macros.hpp" > 33: #include Add blank line ------------- Changes requested by stefank (Reviewer). PR: https://git.openjdk.org/jdk/pull/11618 From chagedorn at openjdk.org Mon Dec 12 12:07:51 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 12 Dec 2022 12:07:51 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors [v4] In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 11:14:29 GMT, Emanuel Peter wrote: >> The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. >> >> We would read `succ` from `_succs[1]`. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 >> >> Then overwrite `_succs[0]` with `succ`, and shorten the array. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 >> >> And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 >> >> **Solution** >> Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). >> >> **Refactoring: added class id for NeverBranch** >> I also added the class id for NeverBranch, and replaced all `Op_NeverBranch` checks with `is_NeverBranch()`. >> >> **Why did we never hit this bug before?** >> Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. >> Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. >> >> Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. >> We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. >> >> ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/cfgnode.hpp > > Co-authored-by: Tobias Hartmann Nice analysis and tests! The fix looks good to me, too. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11481 From epeter at openjdk.org Mon Dec 12 12:14:24 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 12 Dec 2022 12:14:24 GMT Subject: RFR: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors [v4] In-Reply-To: References: Message-ID: <5EjcgAah9xF186LbZAn3Aj3d2FwuklfNYxNurEwMNqE=.937ce125-5039-448b-9406-565a3f128c83@github.com> On Mon, 12 Dec 2022 12:05:21 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/share/opto/cfgnode.hpp >> >> Co-authored-by: Tobias Hartmann > > Nice analysis and tests! The fix looks good to me, too. Thanks @chhagedorn @TobiHartmann for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11481 From epeter at openjdk.org Mon Dec 12 12:14:27 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 12 Dec 2022 12:14:27 GMT Subject: Integrated: 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 12:48:31 GMT, Emanuel Peter wrote: > The code in `PhaseCFG::convert_NeverBranch_to_Goto` looks like it is ready to have `idx == 1`, but it is not. > > We would read `succ` from `_succs[1]`. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L626 > > Then overwrite `_succs[0]` with `succ`, and shorten the array. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L635-L636 > > And finally attempt to read `dead` from `_succs[0]`, where the dead block used to be, but was just overwritten. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L645 > > **Solution** > Read `dead` before overwriting it. I also made it more robust by going via the projections, and not assuming that the projections and successors are ordered equally (though that is probably guaranteed by the matching traversal). > > **Refactoring: added class id for NeverBranch** > I also added the class id for NeverBranch, and replaced all `Op_NeverBranch` checks with `is_NeverBranch()`. > > **Why did we never hit this bug before?** > Normal case: during matching, "succ" projection is added as output of NeverBranch before the "dead" projection leading to Halt. Thus, the outputs of NeverBranch are normally [[ "succ", "dead" ]], hence `idx == 0`. > Details: During DFS, usually we go from Halt to NeverBranch. Then via Region/Loop, take backedge, and find the "succ" edge. We already have its inputs (NeverBranch), thus we can now post-visit the live edge, and attach it to the NeverBranch first. Later, once we have processed the whole infinite loop, we post-visit out of NeverBranch to the "dead" projection edge, which we attach second. > > Rare case: "dead" projection is first attached to NeverBranch, and "succ" projection is added second. We have [[ "dead", "succ" ]], hence `idx == 1`. > We have a peeled infinite loop. The NeverBranch of the peeled iteration is first visited via the "dead" projection from HaltNode. Since the peeled iteration has no backedge, we do not visit the "succ" projection yet, but instead attach "dead" projection to HaltNode already once we are done visiting everything above. Later, we come from the peeled loop's NeverBranch exit, to the "succ" projection of the peeled iteration's NeverBranch, and attach the "succ" projection. > > ![image](https://user-images.githubusercontent.com/32593061/205299027-0e8e1d46-a49c-48c6-81b4-dfe83d8236ec.png) This pull request has now been integrated. Changeset: fabda246 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/fabda246960cfdfff13c5a87de53d97169248172 Stats: 188 lines in 11 files changed: 167 ins; 2 del; 19 mod 8296389: C2: PhaseCFG::convert_NeverBranch_to_Goto must handle both orders of successors Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11481 From kbarrett at openjdk.org Mon Dec 12 17:56:11 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 12 Dec 2022 17:56:11 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: References: Message-ID: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > resource type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: blank lines in include blocks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11618/files - new: https://git.openjdk.org/jdk/pull/11618/files/254b094a..2b764714 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=00-01 Stats: 2 lines in 2 files changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11618.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11618/head:pull/11618 PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Mon Dec 12 17:56:12 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 12 Dec 2022 17:56:12 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: <35ANsb9QEdMZquD20YNzAk6POx4LQzSuBYYcs9AjzrE=.ebd5b981-5768-4295-af63-0db86887b9fb@github.com> References: <35ANsb9QEdMZquD20YNzAk6POx4LQzSuBYYcs9AjzrE=.ebd5b981-5768-4295-af63-0db86887b9fb@github.com> Message-ID: On Mon, 12 Dec 2022 11:46:33 GMT, Stefan Karlsson wrote: >> Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: >> >> blank lines in include blocks > > src/hotspot/share/code/relocInfo.cpp line 39: > >> 37: #include "utilities/copy.hpp" >> 38: #include >> 39: #include > > Please add an empty blank line between HotSpot includes and system includes. We don't explicitly state this, but this was the style that we generated when IncludeDB was removed. Done. > src/hotspot/share/code/relocInfo.hpp line 33: > >> 31: #include "utilities/globalDefinitions.hpp" >> 32: #include "utilities/macros.hpp" >> 33: #include > > Add blank line Done. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From epeter at openjdk.org Tue Dec 13 07:56:32 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Dec 2022 07:56:32 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops Message-ID: `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 As long as the control flow has no loops, this should always hold. We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 **Problem** This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. Thus, the assert fires, but it should not. **Solution** Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). ------------- Commit messages: - made assert more precise - 8296318: use-def assert: special case undetected loops nested in infinite loops Changes: https://git.openjdk.org/jdk/pull/11642/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11642&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296318 Stats: 134 lines in 5 files changed: 109 ins; 23 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11642.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11642/head:pull/11642 PR: https://git.openjdk.org/jdk/pull/11642 From epeter at openjdk.org Tue Dec 13 08:02:45 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Dec 2022 08:02:45 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears Message-ID: We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. **Solution** Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. ------------- Commit messages: - tab to whitespace - code style improvements - 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears Changes: https://git.openjdk.org/jdk20/pull/22/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=22&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298176 Stats: 293 lines in 13 files changed: 268 ins; 5 del; 20 mod Patch: https://git.openjdk.org/jdk20/pull/22.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/22/head:pull/22 PR: https://git.openjdk.org/jdk20/pull/22 From tholenstein at openjdk.org Tue Dec 13 09:02:06 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 13 Dec 2022 09:02:06 GMT Subject: [jdk20] RFR: JDK-8289748: C2 compiled code crashes with SIGFPE with -XX:+StressLCM and -XX:+StressGCM In-Reply-To: <54udVksrhWVx9V1jzxmEM4fNZvZkN_jc3yxR0_8xn-w=.61f08459-3006-4b72-b59e-a3f1bec5f7af@github.com> References: <54udVksrhWVx9V1jzxmEM4fNZvZkN_jc3yxR0_8xn-w=.61f08459-3006-4b72-b59e-a3f1bec5f7af@github.com> Message-ID: On Fri, 9 Dec 2022 11:12:35 GMT, Tobias Holenstein wrote: > # Problem > > `203 CountedLoop` is the post loop of a strip-mined inner loop. In the loop we have `198 ModI` = 1 / tripcount, where tripcount can never be zero in this loop. The tripcount `204 Phi` is further pinned with a `224 CastII`. > > before > > `IdealLoopTree::do_remove_empty_loop(...)` now replaces the tripcounter `204 Phi` with the value that the loop will have on the last iteration: `80 Phi` (exact limit) - `Const 1` (stride). Here is where the mistake happens: > `222 If` is the zero trip guard of the post loop and prevents that the post loop is executed when the divisor of `198 ModI` would be zero. BUT tripcounter `204 Phi` is removed without checking if it is pinned to the IF with a CastNode. Therefore the `198 ModI` now floats above the `222 if` where it now can be a modulo zero operation. > > fail > > # Solution > The Solution is to check if the tripcount is pinned with a CastII that carries a dependency. If Yes, we create a new CastII to pin final_iv (exact_ limit - stride) : > fix Closing and re-targeting for JDK21. Found further issues during testing. e.g. the attached test case fails with additional flags `-XX:+StressGCM -XX:+StressIGVN -XX:+StressCCP -XX:StressSeed=301091046` It seems like we not only falsely remove `CastII` nodes in `IdealLoopTree::do_remove_empty_loop(...)` but also in other places. This takes more time to investigate and fix properly. Too risky to fix for JDK20 and it is not regression introduced in JDK20. ------------- PR: https://git.openjdk.org/jdk20/pull/8 From tholenstein at openjdk.org Tue Dec 13 09:02:06 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 13 Dec 2022 09:02:06 GMT Subject: [jdk20] Withdrawn: JDK-8289748: C2 compiled code crashes with SIGFPE with -XX:+StressLCM and -XX:+StressGCM In-Reply-To: <54udVksrhWVx9V1jzxmEM4fNZvZkN_jc3yxR0_8xn-w=.61f08459-3006-4b72-b59e-a3f1bec5f7af@github.com> References: <54udVksrhWVx9V1jzxmEM4fNZvZkN_jc3yxR0_8xn-w=.61f08459-3006-4b72-b59e-a3f1bec5f7af@github.com> Message-ID: On Fri, 9 Dec 2022 11:12:35 GMT, Tobias Holenstein wrote: > # Problem > > `203 CountedLoop` is the post loop of a strip-mined inner loop. In the loop we have `198 ModI` = 1 / tripcount, where tripcount can never be zero in this loop. The tripcount `204 Phi` is further pinned with a `224 CastII`. > > before > > `IdealLoopTree::do_remove_empty_loop(...)` now replaces the tripcounter `204 Phi` with the value that the loop will have on the last iteration: `80 Phi` (exact limit) - `Const 1` (stride). Here is where the mistake happens: > `222 If` is the zero trip guard of the post loop and prevents that the post loop is executed when the divisor of `198 ModI` would be zero. BUT tripcounter `204 Phi` is removed without checking if it is pinned to the IF with a CastNode. Therefore the `198 ModI` now floats above the `222 if` where it now can be a modulo zero operation. > > fail > > # Solution > The Solution is to check if the tripcount is pinned with a CastII that carries a dependency. If Yes, we create a new CastII to pin final_iv (exact_ limit - stride) : > fix This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk20/pull/8 From rcastanedalo at openjdk.org Tue Dec 13 09:03:43 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 13 Dec 2022 09:03:43 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax [v3] In-Reply-To: <0fGmbCvEUHokqoFR0ndPwd7yA5I_SS-MlBneCPP3LqY=.c4b67306-e5bb-4af5-8875-adba9ea9d0f1@github.com> References: <0fGmbCvEUHokqoFR0ndPwd7yA5I_SS-MlBneCPP3LqY=.c4b67306-e5bb-4af5-8875-adba9ea9d0f1@github.com> Message-ID: On Thu, 8 Dec 2022 09:01:08 GMT, Axel Boldt-Christmas wrote: >> Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. >> >> The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. >> >> This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. >> >> The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. >> >> There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). >> >> It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. >> >> I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: >> * Is there some other way of expressing in the .ad file that a memory input should not share some register? >> * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. >> * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? >> >> Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) > > Axel Boldt-Christmas has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Remove problem listed tests > - Merge remote-tracking branch 'upstream_jdk/master' into JDK-8297235 > - indirect zXChgP as well > - indirect alternative > - JDK-8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax I agree with @fisk's analysis above: the proposed solution should be safe as long as the first CompareAndSwap operand (`mem`) has a non-zero offset (which should at least be the case in the failing Unsafe-based patterns): the register allocator will then treat `mem` and `oldval` as two distinct, interfering values and assign them different registers. I attached a [minimal reproducer](https://bugs.openjdk.org/secure/attachment/102007/Reproducer.java) to the JBS issue, feel free to include it in this PR as a test case if you think it adds value. I do not think there is a general way to express the constraint you want in .ad files, but I am not an expert in this area, maybe someone at Intel could comment on this (@sviswa7, @jatin-bhateja?). I also do not have a feeling for what would be the benefit vs. cost of implementing such construct. An alternative approach could be to enforce the constraint at the C2 IR level, by adding some kind of pseudo-node redefining the input to the first `CompareAndSwap` operand so that it always interferes with `oldval`. Regarding the impact on other architectures, it seems they all follow the solution proposed here, so they should be as safe as in this case, that is, as long as C2 does not generate a CAS comparing the address of the field with its content. I cannot think how C2 could generate such pattern - which of course is not a guarantee that it will never do it ;). ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11410 From chagedorn at openjdk.org Tue Dec 13 10:01:05 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 13 Dec 2022 10:01:05 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 07:49:47 GMT, Emanuel Peter wrote: > `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 > > As long as the control flow has no loops, this should always hold. > We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 > > **Problem** > This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? > During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. > This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. > Thus, the assert fires, but it should not. > > **Solution** > Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). That looks reasonable! src/hotspot/share/opto/block.cpp line 1383: > 1381: assert(block->find_node(def) < j || > 1382: is_loop || > 1383: (block->head()->as_Region()->is_in_infinite_subgraph() && n->is_Phi()), I suggest to swap the order of `n->is_Phi()` and `block->head()->as_Region()->is_in_infinite_subgraph()`. But this should be a rare case anyways, so it does not really matter that much. src/hotspot/share/opto/cfgnode.cpp line 409: > 407: // (no path to root except through false NeverBranch exit) > 408: // worklist is directly used for the traversal > 409: bool RegionNode::are_all_in_infinite_subgraph(Unique_Node_List &worklist) { Suggestion: bool RegionNode::are_all_nodes_in_infinite_subgraph(Unique_Node_List& worklist) { ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11642 From epeter at openjdk.org Tue Dec 13 10:29:00 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Dec 2022 10:29:00 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 09:56:26 GMT, Christian Hagedorn wrote: >> `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 >> >> As long as the control flow has no loops, this should always hold. >> We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 >> >> **Problem** >> This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? >> During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. >> This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. >> Thus, the assert fires, but it should not. >> >> **Solution** >> Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). > > src/hotspot/share/opto/block.cpp line 1383: > >> 1381: assert(block->find_node(def) < j || >> 1382: is_loop || >> 1383: (block->head()->as_Region()->is_in_infinite_subgraph() && n->is_Phi()), > > I suggest to swap the order of `n->is_Phi()` and `block->head()->as_Region()->is_in_infinite_subgraph()`. But this should be a rare case anyways, so it does not really matter that much. ? > src/hotspot/share/opto/cfgnode.cpp line 409: > >> 407: // (no path to root except through false NeverBranch exit) >> 408: // worklist is directly used for the traversal >> 409: bool RegionNode::are_all_in_infinite_subgraph(Unique_Node_List &worklist) { > > Suggestion: > > bool RegionNode::are_all_nodes_in_infinite_subgraph(Unique_Node_List& worklist) { ? ------------- PR: https://git.openjdk.org/jdk/pull/11642 From epeter at openjdk.org Tue Dec 13 10:32:48 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Dec 2022 10:32:48 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops [v2] In-Reply-To: References: Message-ID: > `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 > > As long as the control flow has no loops, this should always hold. > We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 > > **Problem** > This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? > During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. > This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. > Thus, the assert fires, but it should not. > > **Solution** > Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: review suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11642/files - new: https://git.openjdk.org/jdk/pull/11642/files/72a2f0b1..abb1416c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11642&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11642&range=00-01 Stats: 5 lines in 4 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/11642.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11642/head:pull/11642 PR: https://git.openjdk.org/jdk/pull/11642 From stefank at openjdk.org Tue Dec 13 12:41:43 2022 From: stefank at openjdk.org (Stefan Karlsson) Date: Tue, 13 Dec 2022 12:41:43 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: References: <35ANsb9QEdMZquD20YNzAk6POx4LQzSuBYYcs9AjzrE=.ebd5b981-5768-4295-af63-0db86887b9fb@github.com> Message-ID: On Mon, 12 Dec 2022 17:51:50 GMT, Kim Barrett wrote: >> src/hotspot/share/code/relocInfo.cpp line 39: >> >>> 37: #include "utilities/copy.hpp" >>> 38: #include >>> 39: #include >> >> Please add an empty blank line between HotSpot includes and system includes. We don't explicitly state this, but this was the style that we generated when IncludeDB was removed. > > Done. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From eastigeevich at openjdk.org Tue Dec 13 13:48:41 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 13 Dec 2022 13:48:41 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v16] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 16:38:01 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: > > - minor api refactoring: start_scope and roll_back instead of position and set_position > - buffer() returns const array > - cleanup, rename > - warning fix > - add test for buffer grow > - adding jtreg test for CompressedSparseDataReadStream impl > - align java impl to cpp impl > - rewrite the SparseDataWriteStream not to use _curr_byte > - introduce and call flush() excplicitly, add the gtest > - minor renaming. adding encoding examples table > - ... and 7 more: https://git.openjdk.org/jdk/compare/1e468320...e9269942 Hotspot/share/code looks good to me. Just a few minor changes. src/hotspot/share/code/compressedStream.hpp line 121: > 119: > 120: public: > 121: CompressedSparseData(int position = 0) { As it is one-argument constructor, let it be `explicit CompressedSparseData`. src/hotspot/share/code/compressedStream.hpp line 201: > 199: // Start grouped data. Return a byte offset position in the stream where grouped data begins > 200: int start_scope() { > 201: align(); // a side effect! I think we don't need the comment here. src/hotspot/share/code/compressedStream.hpp line 209: > 207: _position = pos; > 208: _bit_position = 0; > 209: assert(_position < _size, "set_position is only used for rollback"); Should we change the assert and move it to the beginning of the function? assert(pos <= _position, "new position must be rollback the current position" ------------- Changes requested by eastigeevich (Committer). PR: https://git.openjdk.org/jdk/pull/10025 From roland at openjdk.org Tue Dec 13 14:56:14 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 13 Dec 2022 14:56:14 GMT Subject: [jdk20] RFR: 8298520: C2: assert(found_opaque == res) failed: wrong pattern Message-ID: The assert fires because CountedLoopNode::is_canonical_loop_entry() finds an Opaque1 where it expects an OpaqueZeroTripGuard. That Opaque1 guards the CountedLoopEnd of the pre loop: is_canonical_loop_entry() should have stopped at the zero trip guard but walked past it. The reason for that is the call for skip_predicates(): when CountedLoopNode::skip_predicates_from_entry() encounters an If node that has lost a projection, it assumes it's a predicate and moves to the next one. In this case, the zero trip guard only has one projection (the one that connects it through predicates to the loop head), because igvn is in the process of updating a dead part of the graph. To fix this, I propose that: 1- CountedLoopNode::is_canonical_loop_entry() fails when after predicates it's at a CountedLoopEnd. 2- CountedLoopNode::skip_predicates_from_entry() stops at a zero trip guard (identified by a condition that uses a OpaqueZeroTripGuard) Either 1- or 2- are good enough to fix this particular graph but I propose doing both as making this logic more robust feels safer. I don't provide a test case as this seems to only happen when igvn processes a dying subgraph in a very specific order. ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk20/pull/24/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=24&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298520 Stats: 21 lines in 2 files changed: 19 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk20/pull/24.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/24/head:pull/24 PR: https://git.openjdk.org/jdk20/pull/24 From roland at openjdk.org Tue Dec 13 15:12:54 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 13 Dec 2022 15:12:54 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: <7uNA2GAnnKm_0z7ELnYiKYAgeVMAkE2FKYYbbw7Ouj8=.8c0d4894-78eb-42f1-80d9-228e5791526e@github.com> On Tue, 13 Dec 2022 07:08:59 GMT, Emanuel Peter wrote: > We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. > > **Solution** > Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. > > An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. OpaqueZeroTripGuardPostLoopNode::Identity() only runs if it's enqueued in the igvn workqueue, that is in the general case if its input changes. I'm not sure it's guaranteed that when the main loop looses its backedge for instance, the OpaqueZeroTripGuardPostLoopNode ends up being processed. The 2 feels like separate events. So I wonder if that fix is robust enough. How does the main loop disappear during CCP? Is it still there but it's backedge is removed? Or is the entire loop removed? If that's the case, is that because of predicates? ------------- PR: https://git.openjdk.org/jdk20/pull/22 From roland at openjdk.org Tue Dec 13 15:16:57 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 13 Dec 2022 15:16:57 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 07:08:59 GMT, Emanuel Peter wrote: > We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. > > **Solution** > Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. > > An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. It also feels like it's a problem that CCP calls Compile::disconnect_useless_nodes() and modifies the graph without letting igvn a chance to act on some graph change. Maybe that part should be revisited. ------------- PR: https://git.openjdk.org/jdk20/pull/22 From aboldtch at openjdk.org Tue Dec 13 15:45:56 2022 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Tue, 13 Dec 2022 15:45:56 GMT Subject: Integrated: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 09:50:11 GMT, Axel Boldt-Christmas wrote: > Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. > > The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. > > This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. > > The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. > > There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). > > It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. > > I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: > * Is there some other way of expressing in the .ad file that a memory input should not share some register? > * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. > * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? > > Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) This pull request has now been integrated. Changeset: 042b7062 Author: Axel Boldt-Christmas URL: https://git.openjdk.org/jdk/commit/042b7062f19b313f31b228bd96d2a74cc1165ab9 Stats: 121 lines in 2 files changed: 21 ins; 90 del; 10 mod 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax Reviewed-by: eosterlund, ayang, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/11410 From thartmann at openjdk.org Tue Dec 13 16:34:01 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 13 Dec 2022 16:34:01 GMT Subject: [jdk20] RFR: 8298520: C2: assert(found_opaque == res) failed: wrong pattern In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 14:49:31 GMT, Roland Westrelin wrote: > The assert fires because CountedLoopNode::is_canonical_loop_entry() > finds an Opaque1 where it expects an OpaqueZeroTripGuard. That Opaque1 > guards the CountedLoopEnd of the pre loop: is_canonical_loop_entry() > should have stopped at the zero trip guard but walked past it. The > reason for that is the call for skip_predicates(): when > CountedLoopNode::skip_predicates_from_entry() encounters an If node > that has lost a projection, it assumes it's a predicate and moves to > the next one. In this case, the zero trip guard only has one > projection (the one that connects it through predicates to the loop > head), because igvn is in the process of updating a dead part of the > graph. > > To fix this, I propose that: > > 1- CountedLoopNode::is_canonical_loop_entry() fails when after > predicates it's at a CountedLoopEnd. > > 2- CountedLoopNode::skip_predicates_from_entry() stops at a zero trip > guard (identified by a condition that uses a OpaqueZeroTripGuard) > > Either 1- or 2- are good enough to fix this particular graph but I > propose doing both as making this logic more robust feels safer. > > I don't provide a test case as this seems to only happen when igvn > processes a dying subgraph in a very specific order. Looks reasonable to me. I submitted testing and will report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk20/pull/24 From chagedorn at openjdk.org Tue Dec 13 16:34:01 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 13 Dec 2022 16:34:01 GMT Subject: [jdk20] RFR: 8298520: C2: assert(found_opaque == res) failed: wrong pattern In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 14:49:31 GMT, Roland Westrelin wrote: > The assert fires because CountedLoopNode::is_canonical_loop_entry() > finds an Opaque1 where it expects an OpaqueZeroTripGuard. That Opaque1 > guards the CountedLoopEnd of the pre loop: is_canonical_loop_entry() > should have stopped at the zero trip guard but walked past it. The > reason for that is the call for skip_predicates(): when > CountedLoopNode::skip_predicates_from_entry() encounters an If node > that has lost a projection, it assumes it's a predicate and moves to > the next one. In this case, the zero trip guard only has one > projection (the one that connects it through predicates to the loop > head), because igvn is in the process of updating a dead part of the > graph. > > To fix this, I propose that: > > 1- CountedLoopNode::is_canonical_loop_entry() fails when after > predicates it's at a CountedLoopEnd. > > 2- CountedLoopNode::skip_predicates_from_entry() stops at a zero trip > guard (identified by a condition that uses a OpaqueZeroTripGuard) > > Either 1- or 2- are good enough to fix this particular graph but I > propose doing both as making this logic more robust feels safer. > > I don't provide a test case as this seems to only happen when igvn > processes a dying subgraph in a very specific order. Fix 2) seems right as the name of the method suggests to only skip actual predicates. But I agree that it makes sense to add fix 1) as well to make it more robust. `skip_predicates_from_entry()` has become quite hard to read but it should be replaced in the redesign of the skeleton predicates anyways. Looks good to me! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/24 From epeter at openjdk.org Tue Dec 13 18:03:09 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Dec 2022 18:03:09 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 15:14:56 GMT, Roland Westrelin wrote: >> We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. >> >> **Solution** >> Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. >> >> An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. > > It also feels like it's a problem that CCP calls Compile::disconnect_useless_nodes() and modifies the graph without letting igvn a chance to act on some graph change. Maybe that part should be revisited. @rwestrel Thanks for the questions. Only doing the check and removal in `OpaqueZeroTripGuardPostLoopNode::Identity()` would is unsafe if the inputs to that opaque node does not change after the main-loop disappears. Can that ever happen? I'm currently not sure. I will do some research on this and come back to you. ------------- PR: https://git.openjdk.org/jdk20/pull/22 From kvn at openjdk.org Tue Dec 13 19:25:19 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 13 Dec 2022 19:25:19 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: <2V7ma_fjnRUKET3GAoLgWAUStOprLKR4xdFVstXqffw=.7874f2a2-f125-4616-b580-bba3b5d036b1@github.com> On Tue, 13 Dec 2022 07:08:59 GMT, Emanuel Peter wrote: > We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. > > **Solution** > Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. > > An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. "The zero-trip guard of the post-loop would be false, but does not collapse because of the OpaqueZeroTripGuard. " Can we just look only for this condition (guard is false) instead of looking through graph to see if main loop disappeared? ------------- PR: https://git.openjdk.org/jdk20/pull/22 From kvn at openjdk.org Tue Dec 13 19:31:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 13 Dec 2022 19:31:54 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops [v2] In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 10:32:48 GMT, Emanuel Peter wrote: >> `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 >> >> As long as the control flow has no loops, this should always hold. >> We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 >> >> **Problem** >> This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? >> During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. >> This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. >> Thus, the assert fires, but it should not. >> >> **Solution** >> Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > review suggestions Please, merge latest JDK so that GHA builds pass. ------------- PR: https://git.openjdk.org/jdk/pull/11642 From kvn at openjdk.org Tue Dec 13 19:35:01 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 13 Dec 2022 19:35:01 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops [v2] In-Reply-To: References: Message-ID: <2S9xqX10oHFHCeyfNG-944L-4ghIU3xUf9adHyORge8=.28bdc434-dc18-4a18-a173-9fdc6a53a90f@github.com> On Tue, 13 Dec 2022 10:32:48 GMT, Emanuel Peter wrote: >> `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 >> >> As long as the control flow has no loops, this should always hold. >> We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 >> >> **Problem** >> This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? >> During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. >> This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. >> Thus, the assert fires, but it should not. >> >> **Solution** >> Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > review suggestions src/hotspot/share/opto/cfgnode.cpp line 399: > 397: // Is this region in an infinite subgraph? > 398: // (no path to root except through false NeverBranch exit) > 399: bool RegionNode::is_in_infinite_subgraph() { It is used only in assert(). Should it be under `#ifdef ASSERT` ? ------------- PR: https://git.openjdk.org/jdk/pull/11642 From jrose at openjdk.org Wed Dec 14 03:39:55 2022 From: jrose at openjdk.org (John R Rose) Date: Wed, 14 Dec 2022 03:39:55 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> Message-ID: On Mon, 12 Dec 2022 17:56:11 GMT, Kim Barrett wrote: >> Please review this change to construction and copying of the Relocation and >> RelocationHolder classes, to eliminate some questionable C++ usage. >> >> The array type for RelocationHandle::_relocbuf is changed from void* to char, >> because using a char array for raw memory is countenanced by the standard, >> while not so much for an array of void*. The desired alignment is maintained >> via a union, since using alignas is not (yet) permitted in HotSpot code. >> >> There is also now a comment discussing the use of _relocbuf in more detail, >> including some areas of continued sketchiness wrto standard conformance and >> reliance on implementation dependent behavior. >> >> No longer use trivial copy and assignment for RelocationHolder, since that >> isn't technically correct. The Relocation in the holder is not trivially >> copyable, since it is polymorphic. It seemed to work in practice with the >> supported compilers, but we shouldn't (and don't need to) rely on it. Instead >> we have a new virtual function Relocation::copy_into that copies the most >> derived object into the holder's _relocbuf using placement new. >> >> Eliminated the implict conversion constructor from Relocation to holder that >> wordwise copied (to possibly beyond the end of) the Relocation into the >> holder's _relocbuf. We could have implemented this more carefully with the >> new approach (using copy_into), but we don't actually need this conversion. >> The only use of it was actually a wasted copy (in assembler_x86.cpp). >> >> Eliminated the use of placement new syntax via operator new with a holder >> argument to copy a Resource object into a holder. This included runtime >> verification that the size of the object is not greater than the size of >> _relocbuf; we now do corresponding verification at compile-time. This also >> included an incorrect attempt at a runtime check that the Relocation base >> class would be at the same address as the derived class being constructed; we >> now perform that check correctly. We also discuss in a comment the layout >> assumption being made (that isn't required by the standard but is provided by >> all supported compilers), and what to do if we encounter a compiler that >> behaves differently. >> >> Eliminated the idiom of making a default-constructed holder and then >> overwriting its held relocation with a newly constructed one, using the afore >> mentioned (and eliminated) operator new. Instead, RelocationHolder now has a >> factory function template (construct) for creating holders with a given >> resource type, constructed using provided arguments. (The arguments are taken >> as const-ref rather than using perfect forwarding, as the tools for the latter >> are not (yet) approved for use in HotSpot. Using const-ref is good enough in >> this case.) >> >> Describe and verify other assumptions being made, such as all Relocation >> classes being trivially destructible. >> >> Testing: >> mach5 tier1-5 >> >> Future work: >> >> * RelocationHolder::reloc isn't const-correct. Making it so will require >> adjustment of some callers. I'll follow up with an RFE to address this. >> >> * Relocation classes have many virtual function overrides that are unmarked. >> I'll follow up with an RFE to add "override" specifiers. >> >> Potential issue: The removal of RelocationHolder(Relocation*) might not work >> for some platforms. I've tested on platforms supported by Oracle (where there >> was only one (mistaken) use). There might be uses by other platforms. > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > blank lines in include blocks I just saw the query from Dean Long in the bug comments. Here is some background information. This "new relocations" work, a long time ago, was an attempt to add some object oriented structure to the access to relocations. At the time, I didn't trust the resource allocation mechanism, but I wanted a way to decode "real C++ objects" from a relocation stream. (Being "real" means supporting virtual methods and a strongly typed API that includes such methods.) The requirement was to be able to iterate quickly over a compressed stream of reloc info and expose the relevant parts as temporary objects to the client. (This is still a good design pattern; nowadays we have the more successful field stream API, and perhaps there is more to come.) In order to make iteration over reloc info fast, I decided to allow those temporary objects to unpack themselves (from the compressed reloc info data) into a buffer on the stack rather than on the heap. The theory is that as you iterate over the stream, you are using just one cache line to access each record, and you get the benefit of a " real" object API. I think sometimes this is called the design pattern of "flyweight objects". At the time, back in the day, I knew that I was coloring outside the lines of C++, but I knew (or hoped) that C++ compilers were not smart enough to detect my transgression beyond the language. Those days are gone now. I am grateful for the current cleanup. I hope that it can preserve the basic idea of a "flyweight object", which is an object whose storage is inside a fixed cache line associated with a stream, and whose lifetime ends when the stream is advanced, to the next flyweight object in line. I hope this because I think that compressed streams are here to stay in the JVM, and yet it is desirable to present the contents of such streams as (temporarily unpacked flyweight) C++ objects, real objects. Here's another peek beyond the horizon, according to me (and this is just me now): I think that the best representation for JVM metadata is (often, maybe even always) something that is stream-oriented and position independent and pointer-free. For example, when we boot up a JVM image, if that image has JVM metadata that was "cooked" before startup, we need to quickly warm up that data so it works in the current JVM process. In such a situation, today's standard C++ data, which is rich in C pointers (which are machine addresses in full machine words that name positions in virtual memory mappings) has a cost; you need to either make sure that the existing pointer values (as seen during CDS generation or Leyden condensation) are still valid, or else you need to edit all such pointer values so they point to the right place in the new address space. (This is what relocation records do in C++. Not the same as this PR.) This means that today's standard C++ data is actually expensive t o build ahead of time, before bootstrap. It would be better if the JVM just accessed blocks of data (probably compressed) which use offsets or other non-pointer means to tie themselves together, like today's reloc info or field info streams. If you buy all that, then there is a permanent place for "flyweight objects" in the JVM, because when you unpack some sort of position-independent data (pointer free data) you probably want to present it to the client as a "real C++ object". But in many cases, it is OK if that object has a very short lifetime, and correspondingly constrained allocation, to the workflow of the stream abstraction that is stepping through the pointer-free data. And that is another reason I'm grateful that Kim has read the C++ manual and figured out a way to do the flyweight objects of reloc info within the C++ rules. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From epeter at openjdk.org Wed Dec 14 05:55:04 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 05:55:04 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops [v2] In-Reply-To: <2S9xqX10oHFHCeyfNG-944L-4ghIU3xUf9adHyORge8=.28bdc434-dc18-4a18-a173-9fdc6a53a90f@github.com> References: <2S9xqX10oHFHCeyfNG-944L-4ghIU3xUf9adHyORge8=.28bdc434-dc18-4a18-a173-9fdc6a53a90f@github.com> Message-ID: On Tue, 13 Dec 2022 19:32:32 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> review suggestions > > src/hotspot/share/opto/cfgnode.cpp line 399: > >> 397: // Is this region in an infinite subgraph? >> 398: // (no path to root except through false NeverBranch exit) >> 399: bool RegionNode::is_in_infinite_subgraph() { > > It is used only in assert(). Should it be under `#ifdef ASSERT` ? ? ------------- PR: https://git.openjdk.org/jdk/pull/11642 From epeter at openjdk.org Wed Dec 14 05:59:38 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 05:59:38 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops [v3] In-Reply-To: References: Message-ID: > `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 > > As long as the control flow has no loops, this should always hold. > We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 > > **Problem** > This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? > During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. > This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. > Thus, the assert fires, but it should not. > > **Solution** > Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - infinite_subgraph check made ASSERT only - Merge branch 'master' into JDK-8296318 - review suggestions - made assert more precise - 8296318: use-def assert: special case undetected loops nested in infinite loops ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11642/files - new: https://git.openjdk.org/jdk/pull/11642/files/abb1416c..6bc867b5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11642&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11642&range=01-02 Stats: 2274 lines in 70 files changed: 688 ins; 1403 del; 183 mod Patch: https://git.openjdk.org/jdk/pull/11642.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11642/head:pull/11642 PR: https://git.openjdk.org/jdk/pull/11642 From epeter at openjdk.org Wed Dec 14 06:05:10 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 06:05:10 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: <2V7ma_fjnRUKET3GAoLgWAUStOprLKR4xdFVstXqffw=.7874f2a2-f125-4616-b580-bba3b5d036b1@github.com> References: <2V7ma_fjnRUKET3GAoLgWAUStOprLKR4xdFVstXqffw=.7874f2a2-f125-4616-b580-bba3b5d036b1@github.com> Message-ID: On Tue, 13 Dec 2022 19:23:08 GMT, Vladimir Kozlov wrote: >> We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. >> >> **Solution** >> Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. >> >> An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. > > "The zero-trip guard of the post-loop would be false, but does not collapse because of the OpaqueZeroTripGuard. " > Can we just look only for this condition (guard is false) instead of looking through graph to see if main loop disappeared? @vnkozlov The problem is that the zero-trip guard can be false for a while, because the post loop has no work. But then unrolling changes how much work the main loop does, and can give the post-loop work to do after all. That is the reason we have this opaque node in the first place. Without that opaque node, the condition could be false, we would remove the post loop. But then unrolling wants to give the post-loop some work, but it has already been optimized away. ------------- PR: https://git.openjdk.org/jdk20/pull/22 From epeter at openjdk.org Wed Dec 14 06:21:47 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 06:21:47 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v2] In-Reply-To: References: Message-ID: > We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. > > **Solution** > Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. > > An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove swap files, how did I ever commit them? ------------- Changes: - all: https://git.openjdk.org/jdk20/pull/22/files - new: https://git.openjdk.org/jdk20/pull/22/files/1dc52234..b83c995e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk20&pr=22&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk20&pr=22&range=00-01 Stats: 0 lines in 2 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk20/pull/22.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/22/head:pull/22 PR: https://git.openjdk.org/jdk20/pull/22 From bulasevich at openjdk.org Wed Dec 14 06:22:48 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 14 Dec 2022 06:22:48 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v17] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: a few minor changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/e9269942..3b9f84e0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=15-16 Stats: 5 lines in 2 files changed: 1 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From fgao at openjdk.org Wed Dec 14 07:24:16 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 14 Dec 2022 07:24:16 GMT Subject: RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point Message-ID: The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. Take AddReductionVF with 128-bit as an example. Here is the assembly code before the patch: fadd s18, s17, s16 mov v19.s[0], v16.s[1] fadd s18, s18, s19 mov v19.s[0], v16.s[2] fadd s18, s18, s19 mov v19.s[0], v16.s[3] fadd s18, s18, s19 Here is the assembly code after the patch: faddp v19.4s, v16.4s, v16.4s faddp s18, v19.2s fadd s18, s18, s17 As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. Tier 1~3 passed with no new failures on Linux AArch64 platform. Here is the perf data of jmh benchmark [3] for the patch: Benchmark size Mode Cnt Before After Units Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316 ------------- Commit messages: - 8298244: AArch64: Optimize vector implementation of AddReduction for floating point Changes: https://git.openjdk.org/jdk/pull/11663/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11663&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298244 Stats: 508 lines in 5 files changed: 40 ins; 16 del; 452 mod Patch: https://git.openjdk.org/jdk/pull/11663.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11663/head:pull/11663 PR: https://git.openjdk.org/jdk/pull/11663 From thartmann at openjdk.org Wed Dec 14 09:14:44 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 14 Dec 2022 09:14:44 GMT Subject: [jdk20] RFR: 8298520: C2: assert(found_opaque == res) failed: wrong pattern In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 14:49:31 GMT, Roland Westrelin wrote: > The assert fires because CountedLoopNode::is_canonical_loop_entry() > finds an Opaque1 where it expects an OpaqueZeroTripGuard. That Opaque1 > guards the CountedLoopEnd of the pre loop: is_canonical_loop_entry() > should have stopped at the zero trip guard but walked past it. The > reason for that is the call for skip_predicates(): when > CountedLoopNode::skip_predicates_from_entry() encounters an If node > that has lost a projection, it assumes it's a predicate and moves to > the next one. In this case, the zero trip guard only has one > projection (the one that connects it through predicates to the loop > head), because igvn is in the process of updating a dead part of the > graph. > > To fix this, I propose that: > > 1- CountedLoopNode::is_canonical_loop_entry() fails when after > predicates it's at a CountedLoopEnd. > > 2- CountedLoopNode::skip_predicates_from_entry() stops at a zero trip > guard (identified by a condition that uses a OpaqueZeroTripGuard) > > Either 1- or 2- are good enough to fix this particular graph but I > propose doing both as making this logic more robust feels safer. > > I don't provide a test case as this seems to only happen when igvn > processes a dying subgraph in a very specific order. All tests passed. ------------- PR: https://git.openjdk.org/jdk20/pull/24 From bulasevich at openjdk.org Wed Dec 14 09:57:14 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 14 Dec 2022 09:57:14 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v16] In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 13:23:56 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - minor api refactoring: start_scope and roll_back instead of position and set_position >> - buffer() returns const array >> - cleanup, rename >> - warning fix >> - add test for buffer grow >> - adding jtreg test for CompressedSparseDataReadStream impl >> - align java impl to cpp impl >> - rewrite the SparseDataWriteStream not to use _curr_byte >> - introduce and call flush() excplicitly, add the gtest >> - minor renaming. adding encoding examples table >> - ... and 7 more: https://git.openjdk.org/jdk/compare/f4a674b2...e9269942 > > src/hotspot/share/code/compressedStream.hpp line 121: > >> 119: >> 120: public: >> 121: CompressedSparseData(int position = 0) { > > As it is one-argument constructor, let it be `explicit CompressedSparseData`. ok > src/hotspot/share/code/compressedStream.hpp line 201: > >> 199: // Start grouped data. Return a byte offset position in the stream where grouped data begins >> 200: int start_scope() { >> 201: align(); // a side effect! > > I think we don't need the comment here. sure > src/hotspot/share/code/compressedStream.hpp line 209: > >> 207: _position = pos; >> 208: _bit_position = 0; >> 209: assert(_position < _size, "set_position is only used for rollback"); > > Should we change the assert and move it to the beginning of the function? > > assert(pos <= _position, "new position must be rollback the current position" right. thank you ------------- PR: https://git.openjdk.org/jdk/pull/10025 From roland at openjdk.org Wed Dec 14 10:02:55 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 10:02:55 GMT Subject: [jdk20] RFR: 8298520: C2: assert(found_opaque == res) failed: wrong pattern In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 09:11:06 GMT, Tobias Hartmann wrote: >> The assert fires because CountedLoopNode::is_canonical_loop_entry() >> finds an Opaque1 where it expects an OpaqueZeroTripGuard. That Opaque1 >> guards the CountedLoopEnd of the pre loop: is_canonical_loop_entry() >> should have stopped at the zero trip guard but walked past it. The >> reason for that is the call for skip_predicates(): when >> CountedLoopNode::skip_predicates_from_entry() encounters an If node >> that has lost a projection, it assumes it's a predicate and moves to >> the next one. In this case, the zero trip guard only has one >> projection (the one that connects it through predicates to the loop >> head), because igvn is in the process of updating a dead part of the >> graph. >> >> To fix this, I propose that: >> >> 1- CountedLoopNode::is_canonical_loop_entry() fails when after >> predicates it's at a CountedLoopEnd. >> >> 2- CountedLoopNode::skip_predicates_from_entry() stops at a zero trip >> guard (identified by a condition that uses a OpaqueZeroTripGuard) >> >> Either 1- or 2- are good enough to fix this particular graph but I >> propose doing both as making this logic more robust feels safer. >> >> I don't provide a test case as this seems to only happen when igvn >> processes a dying subgraph in a very specific order. > > All tests passed. @TobiHartmann @chhagedorn thanks for the reviews and testing. ------------- PR: https://git.openjdk.org/jdk20/pull/24 From roland at openjdk.org Wed Dec 14 10:03:06 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 10:03:06 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization Message-ID: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> This PR re-does 6312651 (Compiler should only use verified interface types for optimization) with a couple fixes I had pushed afterward (8297556 and 8297343) and fixes for some other issues. The trickiest one is a fix for 8297345 (C2: SIGSEGV in PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) only has a single projection. It lost the other projection because of a CheckCastPP that becomes top. Initially the pattern is, in pseudo code,: if (obj.klass == some_class) { obj = CheckCastPP#1(obj); } obj itself is a CheckCastPP that's pinned at a dominating if. That dominating if goes through split through phi. The LoadKlass for the pseudo code above also has control set to the dominating if being transformed. This result in: if (phi1 == some_class) { obj = CheckCastPP#1(phi2); } with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) with obj = (CheckCastPP#2 obj') PhiNode::Ideal() transforms phi2 into a new CheckCastPP: (CheckCastPP#3 obj' obj') with control set to the region right above the if in the pseudo code above. There happens to be another CheckCastPP at the same control which casts obj' to a narrower type. So the new CheckCastPP#3 is replaced by that one (because of ConstraintCastNode::dominating_cast())and pseudo code becomes: if (phi1 == some_class) { obj = CheckCastPP#1(CheckCastPP#4(obj')); } and then: if (phi1 == some_class) { obj = top; } because the types of the 2 CheckCastPPs conflict. That would be ok if: phi1 == some_class would constant fold. It would if the test was: if (CheckCastPP#4(obj').klass == some_klass) { but because of split if, the (CmpP (LoadKlass ..)) and the CheckCastPP#1 ended up with 2 different object inputs that then were transformed differently. The fix I propose is to have split if clone the entire: (Bool (CmpP (LoadKlass (AddP ..)))) down the same way (Bool (CmpP ..)) is cloned down. After split if, the pseudo code becomes: if (phi.klass == some_class) { obj = CheckCastPP#1(phi); } The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) operate on the same phi input. The change in split_if.cpp implements that. The other fixes are: - arraycopynode.cpp: a crash happens because dest_offset and src_offset are the same. The call to transform that results in src_scale, causes src_offset (and thus dest_offset) to become dead. The fix is to add a hook node to preserve dest_offset. This is unrelated to 6312651 but it triggers with that change for some reason. - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code that the change in the handling of interfaces make obsolete and that I missed in the PR for 6312651. - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare assert when during CCP, Value() is called with an input raw constant ptr. - type.cpp: a _klass = NULL field in arrays used to indicate only top or bottom but I changed that so _klass is only guaranteed non null for basic type arrays. The fix in type.cpp updates a piece of code that I didn't adapt to the new meaning of _klass = NULL. - the other changes are due to StressReflectiveCode. With 6312651, a CheckCastPP can fold to top if it sees a type for its input that conflicts with its own type. That wasn't the case before. So if a type check fails, a CheckCastPP will fold to top and the control flow branch it's in must die. That doesn't always happen with StressReflectiveCode: the CheckCastPP folds but not the control flow path. With ExpandSubTypeCheckAtParseTime on, that's because of a code path in LoadNode::Value() that's disabled with StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's because Compile::static_subtype_check() is always pessimistic with StressReflectiveCode but it's used by SubTypeCheckNode::sub() to find when a node can constant fold. ------------- Commit messages: - more fixes - Revert "8297934: [BACKOUT] Compiler should only use verified interface types for optimization" Changes: https://git.openjdk.org/jdk/pull/11666/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297933 Stats: 2245 lines in 29 files changed: 1252 ins; 623 del; 370 mod Patch: https://git.openjdk.org/jdk/pull/11666.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11666/head:pull/11666 PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Wed Dec 14 10:06:48 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 10:06:48 GMT Subject: [jdk20] Integrated: 8298520: C2: assert(found_opaque == res) failed: wrong pattern In-Reply-To: References: Message-ID: <7wCnsfrPbIpsvOU7iwxfjyeQ3_7QsP-osOCeFL6z490=.ac69cbea-d744-45fb-8664-7e8d53885731@github.com> On Tue, 13 Dec 2022 14:49:31 GMT, Roland Westrelin wrote: > The assert fires because CountedLoopNode::is_canonical_loop_entry() > finds an Opaque1 where it expects an OpaqueZeroTripGuard. That Opaque1 > guards the CountedLoopEnd of the pre loop: is_canonical_loop_entry() > should have stopped at the zero trip guard but walked past it. The > reason for that is the call for skip_predicates(): when > CountedLoopNode::skip_predicates_from_entry() encounters an If node > that has lost a projection, it assumes it's a predicate and moves to > the next one. In this case, the zero trip guard only has one > projection (the one that connects it through predicates to the loop > head), because igvn is in the process of updating a dead part of the > graph. > > To fix this, I propose that: > > 1- CountedLoopNode::is_canonical_loop_entry() fails when after > predicates it's at a CountedLoopEnd. > > 2- CountedLoopNode::skip_predicates_from_entry() stops at a zero trip > guard (identified by a condition that uses a OpaqueZeroTripGuard) > > Either 1- or 2- are good enough to fix this particular graph but I > propose doing both as making this logic more robust feels safer. > > I don't provide a test case as this seems to only happen when igvn > processes a dying subgraph in a very specific order. This pull request has now been integrated. Changeset: 27d49711 Author: Roland Westrelin URL: https://git.openjdk.org/jdk20/commit/27d4971182ab1cbe7e6bc40cd22c1c70661a3ab2 Stats: 21 lines in 2 files changed: 19 ins; 0 del; 2 mod 8298520: C2: assert(found_opaque == res) failed: wrong pattern Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk20/pull/24 From epeter at openjdk.org Wed Dec 14 10:14:56 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 10:14:56 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v2] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 06:21:47 GMT, Emanuel Peter wrote: >> We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. >> >> **Solution** >> Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. >> >> An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove swap files, how did I ever commit them? This is an example of what happens with `test_001`: Long loop -> loop-nest inner loop of loop nest is PeelMainPost-ed main loop is empty_loop -> removed CCP after loop-opts: detect that backedge of outer loop-nest is never taken. Hence, initial value to PeelMainPost suddently changes to constant. constant propagates through peeled iteration, constant determines that zero-trip guard of main loop always goes to main loop constant folds through empty_loop (previously main loop), exit value is now also constant. constant arrives at zero-trip guard for post-loop, and also as the initial value for post loop. Post-loop internals detect that that constant is outside trip-count phi range -> all becomes TOP. But Opaque of zero trip guard protects the decay of the zero-trip guard. End of CCP after loop-opts Eventually, we have the if of the post-zero-trip guard with only one projection out (post loop projection has decayed). ------------- PR: https://git.openjdk.org/jdk20/pull/22 From aph at openjdk.org Wed Dec 14 10:42:53 2022 From: aph at openjdk.org (Andrew Haley) Date: Wed, 14 Dec 2022 10:42:53 GMT Subject: RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 07:04:29 GMT, Fei Gao wrote: > The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. > > Take AddReductionVF with 128-bit as an example. > > Here is the assembly code before the patch: > > fadd s18, s17, s16 > mov v19.s[0], v16.s[1] > fadd s18, s18, s19 > mov v19.s[0], v16.s[2] > fadd s18, s18, s19 > mov v19.s[0], v16.s[3] > fadd s18, s18, s19 > > > Here is the assembly code after the patch: > > faddp v19.4s, v16.4s, v16.4s > faddp s18, v19.2s > fadd s18, s18, s17 > > > As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. > > But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: > > 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. > > 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. > > Tier 1~3 passed with no new failures on Linux AArch64 platform. > > Here is the perf data of jmh benchmark [3] for the patch: > > Benchmark size Mode Cnt Before After Units > Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms > Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms > Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms > > [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- > https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- > [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 > https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 > https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316 src/hotspot/cpu/aarch64/aarch64_vector.ad line 2923: > 2921: // reduction addD > 2922: // Specially, the current vector implementation of Op_AddReductionVD works for > 2923: // Vector API only because of the non-sequential order of element addition. Suggestion: // Floating-point addition is not associative, so we cannot auto-vectorize // floating-point reduce-add. AddReductionVD is only generated by. explicit // vector operations. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 129: > 127: // Specially, the current vector implementation of Op_AddReductionVD/F works for > 128: // Vector API only. If re-enabling them for superword, precision loss will happen > 129: // because current generated code does not add elements sequentially from beginning to end. Suggestion: // The vector implementation of Op_AddReductionVD/F is for the Vector API only. // It is not suitable for auto-vectorization because it does not add the elements // in the same order as sequential code, and FPaddition is non-associative. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1815: > 1813: // reduction addF > 1814: // Specially, the current vector implementation of Op_AddReductionVF works for > 1815: // Vector API only because of the non-sequential order of element addition. Suggestion: // Floating-point addition is not associative, so we cannot auto-vectorize // floating-point reduce-add. AddReductionVD is only generated by. explicit // vector operations. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1860: > 1858: // reduction addD > 1859: // Specially, the current vector implementation of Op_AddReductionVD works for > 1860: // Vector API only because of the non-sequential order of element addition. Same here. ------------- PR: https://git.openjdk.org/jdk/pull/11663 From duke at openjdk.org Wed Dec 14 10:51:30 2022 From: duke at openjdk.org (Matthijs Bijman) Date: Wed, 14 Dec 2022 10:51:30 GMT Subject: RFR: 8297791: update _max_classes in NodeClasses Message-ID: Updates _max_classes to reflect the newly introduced LShift node from JDK-8259609 ------------- Commit messages: - 8297791: update _max_classes in NodeClasses Changes: https://git.openjdk.org/jdk/pull/11669/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11669&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297791 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11669.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11669/head:pull/11669 PR: https://git.openjdk.org/jdk/pull/11669 From thartmann at openjdk.org Wed Dec 14 11:24:56 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 14 Dec 2022 11:24:56 GMT Subject: RFR: 8297791: update _max_classes in node type system In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 10:43:24 GMT, Matthijs Bijman wrote: > Updates _max_classes to reflect the newly introduced LShift node from JDK-8259609 Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11669 From epeter at openjdk.org Wed Dec 14 11:29:37 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 11:29:37 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: <2V7ma_fjnRUKET3GAoLgWAUStOprLKR4xdFVstXqffw=.7874f2a2-f125-4616-b580-bba3b5d036b1@github.com> References: <2V7ma_fjnRUKET3GAoLgWAUStOprLKR4xdFVstXqffw=.7874f2a2-f125-4616-b580-bba3b5d036b1@github.com> Message-ID: <_3qNM9nS9h_MWW9QZBPpoOfTupOFuhkjiNjq2MPo_1w=.66454e17-9622-4510-afdb-ce92f667f4cd@github.com> On Tue, 13 Dec 2022 19:23:08 GMT, Vladimir Kozlov wrote: >> We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. >> >> **Solution** >> Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. >> >> An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. > > "The zero-trip guard of the post-loop would be false, but does not collapse because of the OpaqueZeroTripGuard. " > Can we just look only for this condition (guard is false) instead of looking through graph to see if main loop disappeared? @vnkozlov @rwestrel I am now very unsure about this fix myself. On the one hand, I and Christian do not understand why the OpaqueZeroTripGuardPostLoopNode would be there in the first place: This opaque and the iv of the post-loop take their input from the Phi that merges main-loop exit-value and pre-loop exit value (if zero-trip-guard of main decides to not enter main). 1. If this Phi ever decays to a constant: We can never change iv for post-loop. 2. If this Phi ever is a range, but a range outside the trip-count of the post-loop, then the post-loop phi will detect this, and replace itself with top, and deconstruct the loop from the inside. But the OpaqueZeroTripGuardPostLoopNode still guards the zero-trip-guard from decaying. We get an inconsistent graph. So it seems to me we either guard both the iv and the zero-trip-guard or nothing. So far we don't really understand why there is that opaque, it is there for 22 years. Completely removing it could be a solution, but it is unclear if that is safe. Maybe it does something we are not aware of. After re-analyzing, I cannot find a case where the main LoopNode is removed during CCP, it all seems to be during IGVN. So I can try out the fix Roland suggested: when the main LoopNode detects its removal, let it find (graph-walk to) the zero-trip-guard of the post-loop, and remove the opaque node there. I'll need to see how easy that is, given that different things could have decayed during IGVN already, so the graph-walk could be a bit tricky. ------------- PR: https://git.openjdk.org/jdk20/pull/22 From duke at openjdk.org Wed Dec 14 11:35:01 2022 From: duke at openjdk.org (Matthijs Bijman) Date: Wed, 14 Dec 2022 11:35:01 GMT Subject: Integrated: 8297791: update _max_classes in node type system In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 10:43:24 GMT, Matthijs Bijman wrote: > Updates _max_classes to reflect the newly introduced LShift node from JDK-8259609 This pull request has now been integrated. Changeset: d32d6c02 Author: Matthijs Bijman Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/d32d6c028de4aed8d1f1ef70734d43f056a0ff34 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8297791: update _max_classes in node type system Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11669 From thartmann at openjdk.org Wed Dec 14 12:46:36 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 14 Dec 2022 12:46:36 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization In-Reply-To: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Wed, 14 Dec 2022 09:55:27 GMT, Roland Westrelin wrote: > This PR re-does 6312651 (Compiler should only use verified interface > types for optimization) with a couple fixes I had pushed afterward > (8297556 and 8297343) and fixes for some other issues. > > The trickiest one is a fix for 8297345 (C2: SIGSEGV in > PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a > test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) > only has a single projection. It lost the other projection because of > a CheckCastPP that becomes top. Initially the pattern is, in pseudo > code,: > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > obj itself is a CheckCastPP that's pinned at a dominating if. That > dominating if goes through split through phi. The LoadKlass for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) > with obj = (CheckCastPP#2 obj') > > PhiNode::Ideal() transforms phi2 into a new CheckCastPP: > (CheckCastPP#3 obj' obj') with control set to the region right above > the if in the pseudo code above. There happens to be another > CheckCastPP at the same control which casts obj' to a narrower > type. So the new CheckCastPP#3 is replaced by that one (because of > ConstraintCastNode::dominating_cast())and pseudo code becomes: > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > and then: > > if (phi1 == some_class) { > obj = top; > } > > because the types of the 2 CheckCastPPs conflict. That would be ok if: > > phi1 == some_class > > would constant fold. It would if the test was: > > if (CheckCastPP#4(obj').klass == some_klass) { > > but because of split if, the (CmpP (LoadKlass ..)) and the > CheckCastPP#1 ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > (Bool (CmpP (LoadKlass (AddP ..)))) > > down the same way (Bool (CmpP ..)) is cloned down. After split if, the > pseudo code becomes: > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) > operate on the same phi input. The change in split_if.cpp implements > that. > > The other fixes are: > > - arraycopynode.cpp: a crash happens because dest_offset and > src_offset are the same. The call to transform that results in > src_scale, causes src_offset (and thus dest_offset) to become > dead. The fix is to add a hook node to preserve dest_offset. This is > unrelated to 6312651 but it triggers with that change for some > reason. > > - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code > that the change in the handling of interfaces make obsolete and that > I missed in the PR for 6312651. > > - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare > assert when during CCP, Value() is called with an input raw constant > ptr. > > - type.cpp: a _klass = NULL field in arrays used to indicate only top > or bottom but I changed that so _klass is only guaranteed non null > for basic type arrays. The fix in type.cpp updates a piece of code > that I didn't adapt to the new meaning of _klass = NULL. > > - the other changes are due to StressReflectiveCode. With 6312651, a > CheckCastPP can fold to top if it sees a type for its input that > conflicts with its own type. That wasn't the case before. So if a > type check fails, a CheckCastPP will fold to top and the control > flow branch it's in must die. That doesn't always happen with > StressReflectiveCode: the CheckCastPP folds but not the control flow > path. With ExpandSubTypeCheckAtParseTime on, that's because of a > code path in LoadNode::Value() that's disabled with > StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's > because Compile::static_subtype_check() is always pessimistic with > StressReflectiveCode but it's used by SubTypeCheckNode::sub() to > find when a node can constant fold. Changes requested by thartmann (Reviewer). test/hotspot/jtreg/compiler/types/TestCheckCastPPBecomesTOP.java line 31: > 29: * @run main/othervm -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:-BackgroundCompilation > 30: * -XX:CompileOnly=TestCheckCastPPBecomesTOP::test1 -XX:LoopMaxUnroll=0 > 31: * -XX:CompileCommand=dontinline,TestCheckCastPPBecomesTOP::notInlined -XX:+UseParallelGC TestCheckCastPPBecomesTOP This test fails when executed with a different GC (you need to remove the explicit setting or add an `@requires`): Error occurred during initialization of VM Multiple garbage collectors selected (I did some more testing on top of what I already did last week - that's why it didn't show up earlier) ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Wed Dec 14 13:44:56 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 13:44:56 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code Message-ID: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> The problem here is that for arrays, verification code computes a number of meet operations that grows exponentially with the number of dimensions while the number of unique meet operations that need to be computed is a linear function of the number of dimensions: // With verification code, the meet of A and B causes the computation of: // 1- meet(A, B) // 2- meet(B, A) // 3- meet(dual(meet(A, B)), dual(A)) // 4- meet(dual(meet(A, B)), dual(B)) // 5- meet(dual(A), dual(B)) // 6- meet(dual(B), dual(A)) // 7- meet(dual(meet(dual(A), dual(B))), A) // 8- meet(dual(meet(dual(A), dual(B))), B) // // In addition the meet of A[] and B[] requires the computation of the meet of A and B. // // The meet of A[] and B[] triggers the computation of: // 1- meet(A[], B[][) // 1.1- meet(A, B) // 1.2- meet(B, A) // 1.3- meet(dual(meet(A, B)), dual(A)) // 1.4- meet(dual(meet(A, B)), dual(B)) // 1.5- meet(dual(A), dual(B)) // 1.6- meet(dual(B), dual(A)) // 1.7- meet(dual(meet(dual(A), dual(B))), A) // 1.8- meet(dual(meet(dual(A), dual(B))), B) // 2- meet(B[], A[]) // 2.1- meet(B, A) = 1.2 // 2.2- meet(A, B) = 1.1 // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 // 2.5- meet(dual(B), dual(A)) = 1.6 // 2.6- meet(dual(A), dual(B)) = 1.5 // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 // etc. There are a lot of redundant computations being performed. The fix I propose is simply to cache the result of meet computations. So whene the type system code is called to compute, for instance, the meet of A[][] and B[][], the cache starts empty. Then as the meet computations proceed, the cache is filled with meet result for meet of A[] and B[], meet of A and B etc. Once the type system code returns with the result for A[][] and B[][], the cache is cleared. With this, the test case I added goes from "never seem to ever finish" to "complete in no time". ------------- Depends on: https://git.openjdk.org/jdk/pull/11666 Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/11673/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11673&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297582 Stats: 211 lines in 5 files changed: 188 ins; 13 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/11673.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11673/head:pull/11673 PR: https://git.openjdk.org/jdk/pull/11673 From roland at openjdk.org Wed Dec 14 13:50:47 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 13:50:47 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: > This PR re-does 6312651 (Compiler should only use verified interface > types for optimization) with a couple fixes I had pushed afterward > (8297556 and 8297343) and fixes for some other issues. > > The trickiest one is a fix for 8297345 (C2: SIGSEGV in > PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a > test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) > only has a single projection. It lost the other projection because of > a CheckCastPP that becomes top. Initially the pattern is, in pseudo > code,: > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > obj itself is a CheckCastPP that's pinned at a dominating if. That > dominating if goes through split through phi. The LoadKlass for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) > with obj = (CheckCastPP#2 obj') > > PhiNode::Ideal() transforms phi2 into a new CheckCastPP: > (CheckCastPP#3 obj' obj') with control set to the region right above > the if in the pseudo code above. There happens to be another > CheckCastPP at the same control which casts obj' to a narrower > type. So the new CheckCastPP#3 is replaced by that one (because of > ConstraintCastNode::dominating_cast())and pseudo code becomes: > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > and then: > > if (phi1 == some_class) { > obj = top; > } > > because the types of the 2 CheckCastPPs conflict. That would be ok if: > > phi1 == some_class > > would constant fold. It would if the test was: > > if (CheckCastPP#4(obj').klass == some_klass) { > > but because of split if, the (CmpP (LoadKlass ..)) and the > CheckCastPP#1 ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > (Bool (CmpP (LoadKlass (AddP ..)))) > > down the same way (Bool (CmpP ..)) is cloned down. After split if, the > pseudo code becomes: > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) > operate on the same phi input. The change in split_if.cpp implements > that. > > The other fixes are: > > - arraycopynode.cpp: a crash happens because dest_offset and > src_offset are the same. The call to transform that results in > src_scale, causes src_offset (and thus dest_offset) to become > dead. The fix is to add a hook node to preserve dest_offset. This is > unrelated to 6312651 but it triggers with that change for some > reason. > > - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code > that the change in the handling of interfaces make obsolete and that > I missed in the PR for 6312651. > > - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare > assert when during CCP, Value() is called with an input raw constant > ptr. > > - type.cpp: a _klass = NULL field in arrays used to indicate only top > or bottom but I changed that so _klass is only guaranteed non null > for basic type arrays. The fix in type.cpp updates a piece of code > that I didn't adapt to the new meaning of _klass = NULL. > > - the other changes are due to StressReflectiveCode. With 6312651, a > CheckCastPP can fold to top if it sees a type for its input that > conflicts with its own type. That wasn't the case before. So if a > type check fails, a CheckCastPP will fold to top and the control > flow branch it's in must die. That doesn't always happen with > StressReflectiveCode: the CheckCastPP folds but not the control flow > path. With ExpandSubTypeCheckAtParseTime on, that's because of a > code path in LoadNode::Value() that's disabled with > StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's > because Compile::static_subtype_check() is always pessimistic with > StressReflectiveCode but it's used by SubTypeCheckNode::sub() to > find when a node can constant fold. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: test fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11666/files - new: https://git.openjdk.org/jdk/pull/11666/files/05bc4b97..4b86eb58 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=00-01 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11666.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11666/head:pull/11666 PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Wed Dec 14 13:50:48 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 13:50:48 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Wed, 14 Dec 2022 12:43:01 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> test fix > > test/hotspot/jtreg/compiler/types/TestCheckCastPPBecomesTOP.java line 31: > >> 29: * @run main/othervm -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:-BackgroundCompilation >> 30: * -XX:CompileOnly=TestCheckCastPPBecomesTOP::test1 -XX:LoopMaxUnroll=0 >> 31: * -XX:CompileCommand=dontinline,TestCheckCastPPBecomesTOP::notInlined -XX:+UseParallelGC TestCheckCastPPBecomesTOP > > This test fails when executed with a different GC (you need to remove the explicit setting or add an `@requires`): > > Error occurred during initialization of VM > Multiple garbage collectors selected > > > (I did some more testing on top of what I already did last week - that's why it didn't show up earlier) I added a `@requires` ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Wed Dec 14 13:56:49 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 13:56:49 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v2] In-Reply-To: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: > The problem here is that for arrays, verification code computes a > number of meet operations that grows exponentially with the number of > dimensions while the number of unique meet operations that need to be > computed is a linear function of the number of dimensions: > > > // With verification code, the meet of A and B causes the computation of: > // 1- meet(A, B) > // 2- meet(B, A) > // 3- meet(dual(meet(A, B)), dual(A)) > // 4- meet(dual(meet(A, B)), dual(B)) > // 5- meet(dual(A), dual(B)) > // 6- meet(dual(B), dual(A)) > // 7- meet(dual(meet(dual(A), dual(B))), A) > // 8- meet(dual(meet(dual(A), dual(B))), B) > // > // In addition the meet of A[] and B[] requires the computation of the meet of A and B. > // > // The meet of A[] and B[] triggers the computation of: > // 1- meet(A[], B[][) > // 1.1- meet(A, B) > // 1.2- meet(B, A) > // 1.3- meet(dual(meet(A, B)), dual(A)) > // 1.4- meet(dual(meet(A, B)), dual(B)) > // 1.5- meet(dual(A), dual(B)) > // 1.6- meet(dual(B), dual(A)) > // 1.7- meet(dual(meet(dual(A), dual(B))), A) > // 1.8- meet(dual(meet(dual(A), dual(B))), B) > // 2- meet(B[], A[]) > // 2.1- meet(B, A) = 1.2 > // 2.2- meet(A, B) = 1.1 > // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 > // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 > // 2.5- meet(dual(B), dual(A)) = 1.6 > // 2.6- meet(dual(A), dual(B)) = 1.5 > // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 > // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 > // etc. > > > > There are a lot of redundant computations being performed. The fix I > propose is simply to cache the result of meet computations. So whene > the type system code is called to compute, for instance, the meet of > A[][] and B[][], the cache starts empty. Then as the meet computations > proceed, the cache is filled with meet result for meet of A[] and B[], > meet of A and B etc. Once the type system code returns with the result > for A[][] and B[][], the cache is cleared. > > With this, the test case I added goes from "never seem to ever finish" > to "complete in no time". Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: build fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11673/files - new: https://git.openjdk.org/jdk/pull/11673/files/9a8633db..83529981 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11673&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11673&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11673.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11673/head:pull/11673 PR: https://git.openjdk.org/jdk/pull/11673 From eastigeevich at openjdk.org Wed Dec 14 14:10:51 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 14 Dec 2022 14:10:51 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v17] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 06:22:48 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > a few minor changes src/hotspot/share/code/compressedStream.cpp line 119: > 117: bool CompressedSparseDataReadStream::read_zero() { > 118: if (_buffer[_position] & (1 << (7 - _bit_position))) { > 119: return 0; // not a zero data As the return type is `bool`, let's use `false` and 'true` instead of `0` and `1`. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 34: > 32: } > 33: > 34: int bit_pos = 0; Should we use the Java naming convention here? Also, should we follow the Java Code Style? I see the indent style is C++: 2 spaces. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 36: > 34: int bit_pos = 0; > 35: > 36: protected short buffer(int position) { Private? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 40: > 38: } > 39: > 40: public byte readByteImpl() { Should it be private? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 55: > 53: byte b = readByteImpl(); > 54: int result = b & 0x3f; > 55: for (int i = 0; (0 != ((i == 0) ? (b & 0x40) : (b & 0x80))); i++) { It is difficult to read this. Could we rewrite it into: int result = b & 0x3f; if ((b & 0x40) != 0) { b = readByteImpl(); result |= (b & 0x7f) << 6; for (int i = 1; (b & 0x80) != 0; ++i) { b = readByteImpl(); result |= ((b & 0x7f) << (6 + 7 * i)); } } BTW, we can simplify the loop in the `CompressedSparseDataReadStream::read_int()` in the same way. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 62: > 60: } > 61: > 62: boolean readZero() { Private? test/hotspot/jtreg/serviceability/sa/TestCompressedSparseDataReadStream.java line 51: > 49: assertEquals(in.readInt(), 0); // zero bit -> 0 > 50: in.setPosition(2); > 51: assertEquals(in.readInt(), 48); // 0xf000 -> 48 Should we test `readBoolean` and `readByte`? I understand they are implemented with `readInt`. They might incorrectly convert data returned by `readInt`. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From thartmann at openjdk.org Wed Dec 14 14:18:57 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 14 Dec 2022 14:18:57 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Wed, 14 Dec 2022 13:50:47 GMT, Roland Westrelin wrote: >> This PR re-does 6312651 (Compiler should only use verified interface >> types for optimization) with a couple fixes I had pushed afterward >> (8297556 and 8297343) and fixes for some other issues. >> >> The trickiest one is a fix for 8297345 (C2: SIGSEGV in >> PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a >> test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) >> only has a single projection. It lost the other projection because of >> a CheckCastPP that becomes top. Initially the pattern is, in pseudo >> code,: >> >> if (obj.klass == some_class) { >> obj = CheckCastPP#1(obj); >> } >> >> obj itself is a CheckCastPP that's pinned at a dominating if. That >> dominating if goes through split through phi. The LoadKlass for the >> pseudo code above also has control set to the dominating if being >> transformed. This result in: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(phi2); >> } >> >> with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) >> with obj = (CheckCastPP#2 obj') >> >> PhiNode::Ideal() transforms phi2 into a new CheckCastPP: >> (CheckCastPP#3 obj' obj') with control set to the region right above >> the if in the pseudo code above. There happens to be another >> CheckCastPP at the same control which casts obj' to a narrower >> type. So the new CheckCastPP#3 is replaced by that one (because of >> ConstraintCastNode::dominating_cast())and pseudo code becomes: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(CheckCastPP#4(obj')); >> } >> >> and then: >> >> if (phi1 == some_class) { >> obj = top; >> } >> >> because the types of the 2 CheckCastPPs conflict. That would be ok if: >> >> phi1 == some_class >> >> would constant fold. It would if the test was: >> >> if (CheckCastPP#4(obj').klass == some_klass) { >> >> but because of split if, the (CmpP (LoadKlass ..)) and the >> CheckCastPP#1 ended up with 2 different object inputs that then were >> transformed differently. The fix I propose is to have split if clone the entire: >> >> (Bool (CmpP (LoadKlass (AddP ..)))) >> >> down the same way (Bool (CmpP ..)) is cloned down. After split if, the >> pseudo code becomes: >> >> if (phi.klass == some_class) { >> obj = CheckCastPP#1(phi); >> } >> >> The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) >> operate on the same phi input. The change in split_if.cpp implements >> that. >> >> The other fixes are: >> >> - arraycopynode.cpp: a crash happens because dest_offset and >> src_offset are the same. The call to transform that results in >> src_scale, causes src_offset (and thus dest_offset) to become >> dead. The fix is to add a hook node to preserve dest_offset. This is >> unrelated to 6312651 but it triggers with that change for some >> reason. >> >> - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code >> that the change in the handling of interfaces make obsolete and that >> I missed in the PR for 6312651. >> >> - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare >> assert when during CCP, Value() is called with an input raw constant >> ptr. >> >> - type.cpp: a _klass = NULL field in arrays used to indicate only top >> or bottom but I changed that so _klass is only guaranteed non null >> for basic type arrays. The fix in type.cpp updates a piece of code >> that I didn't adapt to the new meaning of _klass = NULL. >> >> - the other changes are due to StressReflectiveCode. With 6312651, a >> CheckCastPP can fold to top if it sees a type for its input that >> conflicts with its own type. That wasn't the case before. So if a >> type check fails, a CheckCastPP will fold to top and the control >> flow branch it's in must die. That doesn't always happen with >> StressReflectiveCode: the CheckCastPP folds but not the control flow >> path. With ExpandSubTypeCheckAtParseTime on, that's because of a >> code path in LoadNode::Value() that's disabled with >> StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's >> because Compile::static_subtype_check() is always pessimistic with >> StressReflectiveCode but it's used by SubTypeCheckNode::sub() to >> find when a node can constant fold. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > test fix test/hotspot/jtreg/compiler/types/TestCheckCastPPBecomesTOP.java line 28: > 26: * @bug 8297345 > 27: * @summary C2: SIGSEGV in PhaseIdealLoop::push_pinned_nodes_thru_region > 28: * @requires vm.gc.Parallel I think it should be: Suggestion: * @requires vm.gc == null | vm.gc.Parallel ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Wed Dec 14 15:44:57 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 14 Dec 2022 15:44:57 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Wed, 14 Dec 2022 14:15:04 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> test fix > > test/hotspot/jtreg/compiler/types/TestCheckCastPPBecomesTOP.java line 28: > >> 26: * @bug 8297345 >> 27: * @summary C2: SIGSEGV in PhaseIdealLoop::push_pinned_nodes_thru_region >> 28: * @requires vm.gc.Parallel > > I think it should be: > > Suggestion: > > * @requires vm.gc == null | vm.gc.Parallel FTR, I discussed this privately with Tobias. His concern is that the test would not be run if not passed `-XX:+UseParallelGC`. That doesn't seem to be an issue though as the test runs as expected when passed no option or `-XX:+UseParallelGC` and doesn't run with a command line option that enables some other gc. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From kvn at openjdk.org Wed Dec 14 17:05:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 14 Dec 2022 17:05:31 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops [v3] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 05:59:38 GMT, Emanuel Peter wrote: >> `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 >> >> As long as the control flow has no loops, this should always hold. >> We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 >> >> **Problem** >> This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? >> During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. >> This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. >> Thus, the assert fires, but it should not. >> >> **Solution** >> Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - infinite_subgraph check made ASSERT only > - Merge branch 'master' into JDK-8296318 > - review suggestions > - made assert more precise > - 8296318: use-def assert: special case undetected loops nested in infinite loops Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11642 From epeter at openjdk.org Wed Dec 14 17:26:58 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 17:26:58 GMT Subject: RFR: 8296318: use-def assert: special case undetected loops nested in infinite loops [v3] In-Reply-To: References: Message-ID: <-7bP5D_nu-a3n8HOGUmWGMTiJ7X73X3vswclkFRPIfE=.8b5d7575-247d-4b55-be1c-65ca23191ad5@github.com> On Wed, 14 Dec 2022 17:02:37 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - infinite_subgraph check made ASSERT only >> - Merge branch 'master' into JDK-8296318 >> - review suggestions >> - made assert more precise >> - 8296318: use-def assert: special case undetected loops nested in infinite loops > > Good. Thanks @vnkozlov @chhagedorn for the reviews and suggestions! ------------- PR: https://git.openjdk.org/jdk/pull/11642 From epeter at openjdk.org Wed Dec 14 17:28:53 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 14 Dec 2022 17:28:53 GMT Subject: Integrated: 8296318: use-def assert: special case undetected loops nested in infinite loops In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 07:49:47 GMT, Emanuel Peter wrote: > `PhaseCFG::verify` checks that the nodes scheduled into the blocks have "use-after-def". > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1376 > > As long as the control flow has no loops, this should always hold. > We make sure to not check the assert if the block-head is a `LoopNode`, and the use `n` is a `Phi`, since we may have inputs from the backedge, which would be "use-before-def" in the block-scheduling order, but that is to be expected. > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/block.cpp#L1364 > > **Problem** > This assumes that all loops are properly detected, if a loop was not detected the block-head may only be a `RegionNode`, and the assert can fail (see regression test). Why are not all loops detected? > During `PhaseIdealLoop::build_loop_tree`, we detect all loops, but not always attach infinite loops with their sub-loops to the loop tree, and then the loop-head `Regions` are not converted to `LoopNodes` in `PhaseIdealLoop::beautify_loops`. > This behaviour is expected, see also https://github.com/openjdk/jdk/pull/11473. > Thus, the assert fires, but it should not. > > **Solution** > Add special casing to assert: also accept if the `Region` is in an infinite subgraph, and we are looking at a `Phi` as use (the def could be values from the backedge). This pull request has now been integrated. Changeset: 736fcd49 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/736fcd49f7cd3aa6f226b2e088415eaf05f97ee8 Stats: 138 lines in 5 files changed: 113 ins; 23 del; 2 mod 8296318: use-def assert: special case undetected loops nested in infinite loops Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11642 From kvn at openjdk.org Thu Dec 15 00:30:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 00:30:04 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: <6s26yXwhYVTz9ad47H4QO9c7miH4895vpvugZMVuUL8=.f78f83d9-e191-4047-af4a-f54c70940d78@github.com> On Wed, 14 Dec 2022 15:42:25 GMT, Roland Westrelin wrote: >> test/hotspot/jtreg/compiler/types/TestCheckCastPPBecomesTOP.java line 28: >> >>> 26: * @bug 8297345 >>> 27: * @summary C2: SIGSEGV in PhaseIdealLoop::push_pinned_nodes_thru_region >>> 28: * @requires vm.gc.Parallel >> >> I think it should be: >> >> Suggestion: >> >> * @requires vm.gc == null | vm.gc.Parallel > > FTR, I discussed this privately with Tobias. His concern is that the test would not be run if not passed `-XX:+UseParallelGC`. That doesn't seem to be an issue though as the test runs as expected when passed no option or `-XX:+UseParallelGC` and doesn't run with a command line option that enables some other gc. Which means it will run with default GC: G1. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From kvn at openjdk.org Thu Dec 15 01:09:10 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 01:09:10 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Wed, 14 Dec 2022 13:50:47 GMT, Roland Westrelin wrote: >> This PR re-does 6312651 (Compiler should only use verified interface >> types for optimization) with a couple fixes I had pushed afterward >> (8297556 and 8297343) and fixes for some other issues. >> >> The trickiest one is a fix for 8297345 (C2: SIGSEGV in >> PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a >> test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) >> only has a single projection. It lost the other projection because of >> a CheckCastPP that becomes top. Initially the pattern is, in pseudo >> code,: >> >> if (obj.klass == some_class) { >> obj = CheckCastPP#1(obj); >> } >> >> obj itself is a CheckCastPP that's pinned at a dominating if. That >> dominating if goes through split through phi. The LoadKlass for the >> pseudo code above also has control set to the dominating if being >> transformed. This result in: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(phi2); >> } >> >> with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) >> with obj = (CheckCastPP#2 obj') >> >> PhiNode::Ideal() transforms phi2 into a new CheckCastPP: >> (CheckCastPP#3 obj' obj') with control set to the region right above >> the if in the pseudo code above. There happens to be another >> CheckCastPP at the same control which casts obj' to a narrower >> type. So the new CheckCastPP#3 is replaced by that one (because of >> ConstraintCastNode::dominating_cast())and pseudo code becomes: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(CheckCastPP#4(obj')); >> } >> >> and then: >> >> if (phi1 == some_class) { >> obj = top; >> } >> >> because the types of the 2 CheckCastPPs conflict. That would be ok if: >> >> phi1 == some_class >> >> would constant fold. It would if the test was: >> >> if (CheckCastPP#4(obj').klass == some_klass) { >> >> but because of split if, the (CmpP (LoadKlass ..)) and the >> CheckCastPP#1 ended up with 2 different object inputs that then were >> transformed differently. The fix I propose is to have split if clone the entire: >> >> (Bool (CmpP (LoadKlass (AddP ..)))) >> >> down the same way (Bool (CmpP ..)) is cloned down. After split if, the >> pseudo code becomes: >> >> if (phi.klass == some_class) { >> obj = CheckCastPP#1(phi); >> } >> >> The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) >> operate on the same phi input. The change in split_if.cpp implements >> that. >> >> The other fixes are: >> >> - arraycopynode.cpp: a crash happens because dest_offset and >> src_offset are the same. The call to transform that results in >> src_scale, causes src_offset (and thus dest_offset) to become >> dead. The fix is to add a hook node to preserve dest_offset. This is >> unrelated to 6312651 but it triggers with that change for some >> reason. >> >> - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code >> that the change in the handling of interfaces make obsolete and that >> I missed in the PR for 6312651. >> >> - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare >> assert when during CCP, Value() is called with an input raw constant >> ptr. >> >> - type.cpp: a _klass = NULL field in arrays used to indicate only top >> or bottom but I changed that so _klass is only guaranteed non null >> for basic type arrays. The fix in type.cpp updates a piece of code >> that I didn't adapt to the new meaning of _klass = NULL. >> >> - the other changes are due to StressReflectiveCode. With 6312651, a >> CheckCastPP can fold to top if it sees a type for its input that >> conflicts with its own type. That wasn't the case before. So if a >> type check fails, a CheckCastPP will fold to top and the control >> flow branch it's in must die. That doesn't always happen with >> StressReflectiveCode: the CheckCastPP folds but not the control flow >> path. With ExpandSubTypeCheckAtParseTime on, that's because of a >> code path in LoadNode::Value() that's disabled with >> StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's >> because Compile::static_subtype_check() is always pessimistic with >> StressReflectiveCode but it's used by SubTypeCheckNode::sub() to >> find when a node can constant fold. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > test fix @rwestrel Can you push split-if/CheckCastPP fix separately, before these changes? It is not directly related to this work. src/hotspot/share/ci/ciInstanceKlass.cpp line 745: > 743: Array* interfaces = ik->transitive_interfaces(); > 744: Arena* arena = CURRENT_ENV->arena(); > 745: int len = interfaces->length() + (is_interface() ? 1 : 0); I think you need to cache `interfaces->length()` in local variable to use in following loop to make sure it is the same value. I concern that an other thread may change it. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From kvn at openjdk.org Thu Dec 15 01:43:09 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 01:43:09 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v2] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: On Wed, 14 Dec 2022 13:56:49 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > build fix src/hotspot/share/opto/compile.cpp line 654: > 652: #endif > 653: #ifdef ASSERT > 654: , _type_verif_cache(comp_arena(), 2, 0, VerifyMeetResult()) I assume it is typo: `_verif_`. Please use `_verify_`. src/hotspot/share/opto/compile.cpp line 1084: > 1082: _phase_optimize_finished = false; > 1083: _exception_backedge = false; > 1084: _type_depth = 0; Please use `_type_verify_depth` to show that it is related to the cache. src/hotspot/share/opto/type.cpp line 845: > 843: } > 844: > 845: class TypeVerif { Please change to `VerifyMeetMark`. src/hotspot/share/opto/type.cpp line 918: > 916: TypeVerif verif(C); > 917: #endif > 918: Can `VerifyMeetResult` class be local to `Type` class instead of `Compile` since it is used only locally here and you reset cache each time anyway? (`Compile` is become very big with verification code from all parts of compiler). I thought you build the cache for duration of compilation then it may make sense to keep it in `Compile`. But you are resetting it after each `meet` so you don't need to have it in `Compile`. ------------- PR: https://git.openjdk.org/jdk/pull/11673 From duke at openjdk.org Thu Dec 15 03:01:49 2022 From: duke at openjdk.org (SUN Guoyun) Date: Thu, 15 Dec 2022 03:01:49 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate Message-ID: Hi all, For C2, convert double to float cause a loss of precision,

./chaitin.cpp:221
_high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used:

./coalesce.cpp:379
if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
   ...
}
Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. There are two cases that I tested for SPECjvm2008 crypto.aes. case 1:

//chaitin.cpp:221
// fcvt.s.d $f0,$f0 #double->float
d = 16.994714324523816
f = 16.9947147

//coalesce.cpp:379
// fcvt.d.s $f0,$f0 #float->double
// fcmp.sle.d $fcc2,$f0,$f1
(gdb) i r fa0
fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
(gdb) i r fa1
fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
case2:

//chaitin.cpp:221
// fcvt.s.d $f0,$f0
d = 16.996332681816536
f = 16.9963322

//coalesce.cpp
// fcvt.d.s $f0,$f0
// fcmp.sle.d $fcc2,$f0,$f1
(gdb) i r fa0
fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
(gdb) i r fa1
fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. This is a patch to fix this problem. Please help review it. Thanks, Sun Guoyun ------------- Commit messages: - 8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate Changes: https://git.openjdk.org/jdk/pull/11685/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11685&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298813 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11685.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11685/head:pull/11685 PR: https://git.openjdk.org/jdk/pull/11685 From kvn at openjdk.org Thu Dec 15 08:22:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 08:22:06 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v2] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 06:21:47 GMT, Emanuel Peter wrote: >> We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. >> >> **Solution** >> Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. >> >> An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove swap files, how did I ever commit them? Based on comment and original changes it checks only iv from main loop exit and not Phi as you pointed: Node *incr = main_end ->incr(); ... // Step A2: Build a zero-trip guard for the post-loop. After leaving the // main-loop, the post-loop may not execute at all. We 'opaque' the incr // (the main-loop trip-counter exit value) because we will be changing // the exit value (via unrolling) so we cannot constant-fold away the zero // trip guard until all unrolling is done. Node *zer_opaq = new (2) OpaqueNode(incr); It seems incorrect based on what you said and I think you can test your suggestion to completely remove it. Saying that, the current code looks the same. Where Phi comes from? ------------- PR: https://git.openjdk.org/jdk20/pull/22 From epeter at openjdk.org Thu Dec 15 08:25:06 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Dec 2022 08:25:06 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v2] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 06:21:47 GMT, Emanuel Peter wrote: >> We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. >> >> **Solution** >> Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. >> >> An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove swap files, how did I ever commit them? I think you are referring to the Phi that belongs to the region just above the post loop? There we merge "main not taken" and "main exit". ------------- PR: https://git.openjdk.org/jdk20/pull/22 From roland at openjdk.org Thu Dec 15 08:32:06 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 08:32:06 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 01:05:55 GMT, Vladimir Kozlov wrote: > @rwestrel Can you push split-if/CheckCastPP fix separately, before these changes? It is not directly related to this work. The split if issue is not there without the rest of the change. Do you want it as a separate change anyway? ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Thu Dec 15 08:35:06 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 08:35:06 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: <6s26yXwhYVTz9ad47H4QO9c7miH4895vpvugZMVuUL8=.f78f83d9-e191-4047-af4a-f54c70940d78@github.com> References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> <6s26yXwhYVTz9ad47H4QO9c7miH4895vpvugZMVuUL8=.f78f83d9-e191-4047-af4a-f54c70940d78@github.com> Message-ID: On Thu, 15 Dec 2022 00:27:01 GMT, Vladimir Kozlov wrote: >> FTR, I discussed this privately with Tobias. His concern is that the test would not be run if not passed `-XX:+UseParallelGC`. That doesn't seem to be an issue though as the test runs as expected when passed no option or `-XX:+UseParallelGC` and doesn't run with a command line option that enables some other gc. > > Yes, it is correct way when you need to specify GC. Which is the correct way? `@requires vm.gc == null | vm.gc.Parallel` or `@requires vm.gc.Parallel`? ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Thu Dec 15 08:38:06 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 08:38:06 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v2] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: On Thu, 15 Dec 2022 01:28:27 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> build fix > > src/hotspot/share/opto/type.cpp line 918: > >> 916: TypeVerif verif(C); >> 917: #endif >> 918: > > Can `VerifyMeetResult` class be local to `Type` class instead of `Compile` since it is used only locally here and you reset cache each time anyway? (`Compile` is become very big with verification code from all parts of compiler). > > I thought you build the cache for duration of compilation then it may make sense to keep it in `Compile`. But you are resetting it after each `meet` so you don't need to have it in `Compile`. What I could do is have a pointer to the cache as a field in Compile but move the new class declaration out of compile.hpp in type.[ch]pp? Does that sound ok to you? ------------- PR: https://git.openjdk.org/jdk/pull/11673 From kvn at openjdk.org Thu Dec 15 08:43:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 08:43:04 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 08:28:57 GMT, Roland Westrelin wrote: > > @rwestrel Can you push split-if/CheckCastPP fix separately, before these changes? It is not directly related to this work. > > The split if issue is not there without the rest of the change. Do you want it as a separate change anyway? I understand that. But it is significant change and I want to push it first and test separately. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From kvn at openjdk.org Thu Dec 15 09:01:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 09:01:06 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> <6s26yXwhYVTz9ad47H4QO9c7miH4895vpvugZMVuUL8=.f78f83d9-e191-4047-af4a-f54c70940d78@github.com> Message-ID: On Thu, 15 Dec 2022 08:32:12 GMT, Roland Westrelin wrote: >> Yes, it is correct way when you need to specify GC. > > Which is the correct way? `@requires vm.gc == null | vm.gc.Parallel` or `@requires vm.gc.Parallel`? `@requires vm.gc.Parallel` [JDK-8160088](https://bugs.openjdk.org/browse/JDK-8160088) changed it from original `@requires vm.gc == "null" | vm.gc == "Parallel"` ------------- PR: https://git.openjdk.org/jdk/pull/11666 From kvn at openjdk.org Thu Dec 15 09:03:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 09:03:07 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v2] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: On Thu, 15 Dec 2022 08:35:29 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/type.cpp line 918: >> >>> 916: TypeVerif verif(C); >>> 917: #endif >>> 918: >> >> Can `VerifyMeetResult` class be local to `Type` class instead of `Compile` since it is used only locally here and you reset cache each time anyway? (`Compile` is become very big with verification code from all parts of compiler). >> >> I thought you build the cache for duration of compilation then it may make sense to keep it in `Compile`. But you are resetting it after each `meet` so you don't need to have it in `Compile`. > > What I could do is have a pointer to the cache as a field in Compile but move the new class declaration out of compile.hpp in type.[ch]pp? Does that sound ok to you? Yes, have only pointer to cache in Compile class is fine. ------------- PR: https://git.openjdk.org/jdk/pull/11673 From pli at openjdk.org Thu Dec 15 09:08:44 2022 From: pli at openjdk.org (Pengfei Li) Date: Thu, 15 Dec 2022 09:08:44 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests Message-ID: In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. A few more test cases are added within this patch as well. We tested the new IR rules on below kinds of CPUs. - AArch64 w/ 512-bit SVE - AArch64 w/ 128-bit SVE - AArch64 w/o SVE (NEON only) - x86 ------------- Commit messages: - 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests Changes: https://git.openjdk.org/jdk/pull/11687/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11687&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298632 Stats: 470 lines in 23 files changed: 423 ins; 31 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/11687.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11687/head:pull/11687 PR: https://git.openjdk.org/jdk/pull/11687 From kvn at openjdk.org Thu Dec 15 09:10:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 09:10:08 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> <6s26yXwhYVTz9ad47H4QO9c7miH4895vpvugZMVuUL8=.f78f83d9-e191-4047-af4a-f54c70940d78@github.com> Message-ID: On Thu, 15 Dec 2022 08:57:56 GMT, Vladimir Kozlov wrote: >> Which is the correct way? `@requires vm.gc == null | vm.gc.Parallel` or `@requires vm.gc.Parallel`? > > `@requires vm.gc.Parallel` > > [JDK-8160088](https://bugs.openjdk.org/browse/JDK-8160088) changed it from original `@requires vm.gc == "null" | vm.gc == "Parallel"` Clarifying. `vm.gc.Parallel` is `true` when ParallelGC supported and its flag is specified on command line or no GC flags specified and any GC is selected ergonomically (which means it can be overwritten by flag on command lien). Which means the same as original `@requires`. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Thu Dec 15 09:18:04 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 09:18:04 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 08:40:18 GMT, Vladimir Kozlov wrote: > I understand that. But it is significant change and I want to push it first and test separately. I filed JDK-8298848 for that. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From fgao at openjdk.org Thu Dec 15 09:39:24 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 15 Dec 2022 09:39:24 GMT Subject: RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point [v2] In-Reply-To: References: Message-ID: > The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. > > Take AddReductionVF with 128-bit as an example. > > Here is the assembly code before the patch: > > fadd s18, s17, s16 > mov v19.s[0], v16.s[1] > fadd s18, s18, s19 > mov v19.s[0], v16.s[2] > fadd s18, s18, s19 > mov v19.s[0], v16.s[3] > fadd s18, s18, s19 > > > Here is the assembly code after the patch: > > faddp v19.4s, v16.4s, v16.4s > faddp s18, v19.2s > fadd s18, s18, s17 > > > As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. > > But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: > > 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. > > 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. > > Tier 1~3 passed with no new failures on Linux AArch64 platform. > > Here is the perf data of jmh benchmark [3] for the patch: > > Benchmark size Mode Cnt Before After Units > Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms > Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms > Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms > > [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- > https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- > [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 > https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 > https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316 Fei Gao has updated the pull request incrementally with one additional commit since the last revision: Update the comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11663/files - new: https://git.openjdk.org/jdk/pull/11663/files/1c91fc6e..87ac1745 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11663&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11663&range=00-01 Stats: 18 lines in 2 files changed: 4 ins; 0 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/11663.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11663/head:pull/11663 PR: https://git.openjdk.org/jdk/pull/11663 From fgao at openjdk.org Thu Dec 15 09:43:08 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 15 Dec 2022 09:43:08 GMT Subject: RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point [v2] In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 09:39:24 GMT, Fei Gao wrote: >> The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. >> >> Take AddReductionVF with 128-bit as an example. >> >> Here is the assembly code before the patch: >> >> fadd s18, s17, s16 >> mov v19.s[0], v16.s[1] >> fadd s18, s18, s19 >> mov v19.s[0], v16.s[2] >> fadd s18, s18, s19 >> mov v19.s[0], v16.s[3] >> fadd s18, s18, s19 >> >> >> Here is the assembly code after the patch: >> >> faddp v19.4s, v16.4s, v16.4s >> faddp s18, v19.2s >> fadd s18, s18, s17 >> >> >> As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. >> >> But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: >> >> 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. >> >> 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. >> >> Tier 1~3 passed with no new failures on Linux AArch64 platform. >> >> Here is the perf data of jmh benchmark [3] for the patch: >> >> Benchmark size Mode Cnt Before After Units >> Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms >> Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms >> Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms >> >> [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- >> https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- >> [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc >> [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 >> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 >> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316 > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Update the comments Thanks for your kind review and comments, @theRealAph. I addressed them in the new commit. Could you please help review it? Thanks for your time! ------------- PR: https://git.openjdk.org/jdk/pull/11663 From duke at openjdk.org Thu Dec 15 09:53:10 2022 From: duke at openjdk.org (SUN Guoyun) Date: Thu, 15 Dec 2022 09:53:10 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 02:52:51 GMT, SUN Guoyun wrote: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun The test FAIL and ERROR items has nothing to do with this patch. ------------- PR: https://git.openjdk.org/jdk/pull/11685 From roland at openjdk.org Thu Dec 15 11:04:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 11:04:03 GMT Subject: RFR: 8298848: C2: clone all of (CmpP (LoadKlass (AddP down at split if Message-ID: As suggested by Vladimir in: https://github.com/openjdk/jdk/pull/11666 Thus extract one for the fixes as a separate PR. The bug as described in the above PR is: The crash occurs because a` (If (Bool (CmpP (LoadKlass ..))))` only has a single projection. It lost the other projection because of a `CheckCastPP` that becomes `top`. Initially the pattern is, in pseudo code: if (obj.klass == some_class) { obj = CheckCastPP#1(obj); } `obj` itself is a `CheckCastPP` that's pinned at a dominating if. That dominating if goes through split through phi. The `LoadKlass` for the pseudo code above also has control set to the dominating if being transformed. This result in: if (phi1 == some_class) { obj = CheckCastPP#1(phi2); } with` phi1 = (Phi (LoadKlass obj) (LoadKlass obj))` and phi2 = (Phi obj obj) with `obj = (CheckCastPP#2 obj')` `PhiNode::Ideal()` transforms `phi2` into a new `CheckCastPP`: `(CheckCastPP#3 obj' obj') `with control set to the region right above the if in the pseudo code above. There happens to be another `CheckCastPP` at the same control which casts obj' to a narrower type. So the new `CheckCastPP#3` is replaced by that one (because of `ConstraintCastNode::dominating_cast()`) and pseudo code becomes: if (phi1 == some_class) { obj = CheckCastPP#1(CheckCastPP#4(obj')); } and then: if (phi1 == some_class) { obj = top; } because the types of the 2 `CheckCastPP`s conflict. That would be ok if: `phi1 == some_class` would constant fold. It would if the test was: `if (CheckCastPP#4(obj').klass == some_klass) { ` but because of split if, the `(CmpP (LoadKlass ..))` and the `CheckCastPP#1` ended up with 2 different object inputs that then were transformed differently. The fix I propose is to have split if clone the entire: `(Bool (CmpP (LoadKlass (AddP ..))))` down the same way `(Bool (CmpP ..))` is cloned down. After split if, the pseudo code becomes: if (phi.klass == some_class) { obj = CheckCastPP#1(phi); } The bug can't occur because the `CheckCastPP` and` (CmpP (LoadKlass ..))` operate on the same phi input. The change in split_if.cpp implements that. ------------- Commit messages: - test & fix Changes: https://git.openjdk.org/jdk/pull/11689/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11689&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298848 Stats: 556 lines in 3 files changed: 442 ins; 104 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/11689.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11689/head:pull/11689 PR: https://git.openjdk.org/jdk/pull/11689 From roland at openjdk.org Thu Dec 15 11:06:11 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 11:06:11 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v3] In-Reply-To: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: > This PR re-does 6312651 (Compiler should only use verified interface > types for optimization) with a couple fixes I had pushed afterward > (8297556 and 8297343) and fixes for some other issues. > > The trickiest one is a fix for 8297345 (C2: SIGSEGV in > PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a > test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) > only has a single projection. It lost the other projection because of > a CheckCastPP that becomes top. Initially the pattern is, in pseudo > code,: > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > obj itself is a CheckCastPP that's pinned at a dominating if. That > dominating if goes through split through phi. The LoadKlass for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) > with obj = (CheckCastPP#2 obj') > > PhiNode::Ideal() transforms phi2 into a new CheckCastPP: > (CheckCastPP#3 obj' obj') with control set to the region right above > the if in the pseudo code above. There happens to be another > CheckCastPP at the same control which casts obj' to a narrower > type. So the new CheckCastPP#3 is replaced by that one (because of > ConstraintCastNode::dominating_cast())and pseudo code becomes: > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > and then: > > if (phi1 == some_class) { > obj = top; > } > > because the types of the 2 CheckCastPPs conflict. That would be ok if: > > phi1 == some_class > > would constant fold. It would if the test was: > > if (CheckCastPP#4(obj').klass == some_klass) { > > but because of split if, the (CmpP (LoadKlass ..)) and the > CheckCastPP#1 ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > (Bool (CmpP (LoadKlass (AddP ..)))) > > down the same way (Bool (CmpP ..)) is cloned down. After split if, the > pseudo code becomes: > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) > operate on the same phi input. The change in split_if.cpp implements > that. > > The other fixes are: > > - arraycopynode.cpp: a crash happens because dest_offset and > src_offset are the same. The call to transform that results in > src_scale, causes src_offset (and thus dest_offset) to become > dead. The fix is to add a hook node to preserve dest_offset. This is > unrelated to 6312651 but it triggers with that change for some > reason. > > - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code > that the change in the handling of interfaces make obsolete and that > I missed in the PR for 6312651. > > - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare > assert when during CCP, Value() is called with an input raw constant > ptr. > > - type.cpp: a _klass = NULL field in arrays used to indicate only top > or bottom but I changed that so _klass is only guaranteed non null > for basic type arrays. The fix in type.cpp updates a piece of code > that I didn't adapt to the new meaning of _klass = NULL. > > - the other changes are due to StressReflectiveCode. With 6312651, a > CheckCastPP can fold to top if it sees a type for its input that > conflicts with its own type. That wasn't the case before. So if a > type check fails, a CheckCastPP will fold to top and the control > flow branch it's in must die. That doesn't always happen with > StressReflectiveCode: the CheckCastPP folds but not the control flow > path. With ExpandSubTypeCheckAtParseTime on, that's because of a > code path in LoadNode::Value() that's disabled with > StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's > because Compile::static_subtype_check() is always pessimistic with > StressReflectiveCode but it's used by SubTypeCheckNode::sub() to > find when a node can constant fold. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: undo split if change ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11666/files - new: https://git.openjdk.org/jdk/pull/11666/files/4b86eb58..5179a664 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=01-02 Stats: 556 lines in 3 files changed: 104 ins; 442 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/11666.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11666/head:pull/11666 PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Thu Dec 15 11:06:12 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 11:06:12 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 09:15:05 GMT, Roland Westrelin wrote: > > I understand that. But it is significant change and I want to push it first and test separately. > > I filed JDK-8298848 for that. And I removed the splif if change from this PR. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Thu Dec 15 11:15:49 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 11:15:49 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v4] In-Reply-To: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: > This PR re-does 6312651 (Compiler should only use verified interface > types for optimization) with a couple fixes I had pushed afterward > (8297556 and 8297343) and fixes for some other issues. > > The trickiest one is a fix for 8297345 (C2: SIGSEGV in > PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a > test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) > only has a single projection. It lost the other projection because of > a CheckCastPP that becomes top. Initially the pattern is, in pseudo > code,: > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > obj itself is a CheckCastPP that's pinned at a dominating if. That > dominating if goes through split through phi. The LoadKlass for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) > with obj = (CheckCastPP#2 obj') > > PhiNode::Ideal() transforms phi2 into a new CheckCastPP: > (CheckCastPP#3 obj' obj') with control set to the region right above > the if in the pseudo code above. There happens to be another > CheckCastPP at the same control which casts obj' to a narrower > type. So the new CheckCastPP#3 is replaced by that one (because of > ConstraintCastNode::dominating_cast())and pseudo code becomes: > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > and then: > > if (phi1 == some_class) { > obj = top; > } > > because the types of the 2 CheckCastPPs conflict. That would be ok if: > > phi1 == some_class > > would constant fold. It would if the test was: > > if (CheckCastPP#4(obj').klass == some_klass) { > > but because of split if, the (CmpP (LoadKlass ..)) and the > CheckCastPP#1 ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > (Bool (CmpP (LoadKlass (AddP ..)))) > > down the same way (Bool (CmpP ..)) is cloned down. After split if, the > pseudo code becomes: > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) > operate on the same phi input. The change in split_if.cpp implements > that. > > The other fixes are: > > - arraycopynode.cpp: a crash happens because dest_offset and > src_offset are the same. The call to transform that results in > src_scale, causes src_offset (and thus dest_offset) to become > dead. The fix is to add a hook node to preserve dest_offset. This is > unrelated to 6312651 but it triggers with that change for some > reason. > > - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code > that the change in the handling of interfaces make obsolete and that > I missed in the PR for 6312651. > > - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare > assert when during CCP, Value() is called with an input raw constant > ptr. > > - type.cpp: a _klass = NULL field in arrays used to indicate only top > or bottom but I changed that so _klass is only guaranteed non null > for basic type arrays. The fix in type.cpp updates a piece of code > that I didn't adapt to the new meaning of _klass = NULL. > > - the other changes are due to StressReflectiveCode. With 6312651, a > CheckCastPP can fold to top if it sees a type for its input that > conflicts with its own type. That wasn't the case before. So if a > type check fails, a CheckCastPP will fold to top and the control > flow branch it's in must die. That doesn't always happen with > StressReflectiveCode: the CheckCastPP folds but not the control flow > path. With ExpandSubTypeCheckAtParseTime on, that's because of a > code path in LoadNode::Value() that's disabled with > StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's > because Compile::static_subtype_check() is always pessimistic with > StressReflectiveCode but it's used by SubTypeCheckNode::sub() to > find when a node can constant fold. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: extract interfaces->length() in ciInstanceKlass.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11666/files - new: https://git.openjdk.org/jdk/pull/11666/files/5179a664..848fc8df Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=02-03 Stats: 4 lines in 1 file changed: 1 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11666.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11666/head:pull/11666 PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Thu Dec 15 11:15:52 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 11:15:52 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 00:52:03 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> test fix > > src/hotspot/share/ci/ciInstanceKlass.cpp line 745: > >> 743: Array* interfaces = ik->transitive_interfaces(); >> 744: Arena* arena = CURRENT_ENV->arena(); >> 745: int len = interfaces->length() + (is_interface() ? 1 : 0); > > I think you need to cache `interfaces->length()` in local variable to use in following loop to make sure it is the same value. I concern that an other thread may change it. Done in new commit. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From aph at openjdk.org Thu Dec 15 12:07:05 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 15 Dec 2022 12:07:05 GMT Subject: RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point [v2] In-Reply-To: References: Message-ID: <2yYNfUMM6MTmTXzhDs3C7TfPQw4HuHCHnQBlv4P-YGY=.63f5aa05-21e7-44ff-a0c6-f6f7663238f9@github.com> On Thu, 15 Dec 2022 09:39:24 GMT, Fei Gao wrote: >> The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. >> >> Take AddReductionVF with 128-bit as an example. >> >> Here is the assembly code before the patch: >> >> fadd s18, s17, s16 >> mov v19.s[0], v16.s[1] >> fadd s18, s18, s19 >> mov v19.s[0], v16.s[2] >> fadd s18, s18, s19 >> mov v19.s[0], v16.s[3] >> fadd s18, s18, s19 >> >> >> Here is the assembly code after the patch: >> >> faddp v19.4s, v16.4s, v16.4s >> faddp s18, v19.2s >> fadd s18, s18, s17 >> >> >> As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. >> >> But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: >> >> 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. >> >> 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. >> >> Tier 1~3 passed with no new failures on Linux AArch64 platform. >> >> Here is the perf data of jmh benchmark [3] for the patch: >> >> Benchmark size Mode Cnt Before After Units >> Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms >> Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms >> Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms >> >> [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- >> https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- >> [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc >> [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 >> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 >> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316 > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Update the comments Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11663 From jvernee at openjdk.org Thu Dec 15 12:42:10 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Thu, 15 Dec 2022 12:42:10 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> Message-ID: On Mon, 12 Dec 2022 17:56:11 GMT, Kim Barrett wrote: >> Please review this change to construction and copying of the Relocation and >> RelocationHolder classes, to eliminate some questionable C++ usage. >> >> The array type for RelocationHandle::_relocbuf is changed from void* to char, >> because using a char array for raw memory is countenanced by the standard, >> while not so much for an array of void*. The desired alignment is maintained >> via a union, since using alignas is not (yet) permitted in HotSpot code. >> >> There is also now a comment discussing the use of _relocbuf in more detail, >> including some areas of continued sketchiness wrto standard conformance and >> reliance on implementation dependent behavior. >> >> No longer use trivial copy and assignment for RelocationHolder, since that >> isn't technically correct. The Relocation in the holder is not trivially >> copyable, since it is polymorphic. It seemed to work in practice with the >> supported compilers, but we shouldn't (and don't need to) rely on it. Instead >> we have a new virtual function Relocation::copy_into that copies the most >> derived object into the holder's _relocbuf using placement new. >> >> Eliminated the implict conversion constructor from Relocation to holder that >> wordwise copied (to possibly beyond the end of) the Relocation into the >> holder's _relocbuf. We could have implemented this more carefully with the >> new approach (using copy_into), but we don't actually need this conversion. >> The only use of it was actually a wasted copy (in assembler_x86.cpp). >> >> Eliminated the use of placement new syntax via operator new with a holder >> argument to copy a Resource object into a holder. This included runtime >> verification that the size of the object is not greater than the size of >> _relocbuf; we now do corresponding verification at compile-time. This also >> included an incorrect attempt at a runtime check that the Relocation base >> class would be at the same address as the derived class being constructed; we >> now perform that check correctly. We also discuss in a comment the layout >> assumption being made (that isn't required by the standard but is provided by >> all supported compilers), and what to do if we encounter a compiler that >> behaves differently. >> >> Eliminated the idiom of making a default-constructed holder and then >> overwriting its held relocation with a newly constructed one, using the afore >> mentioned (and eliminated) operator new. Instead, RelocationHolder now has a >> factory function template (construct) for creating holders with a given >> relocation type, constructed using provided arguments. (The arguments are taken >> as const-ref rather than using perfect forwarding, as the tools for the latter >> are not (yet) approved for use in HotSpot. Using const-ref is good enough in >> this case.) >> >> Describe and verify other assumptions being made, such as all Relocation >> classes being trivially destructible. >> >> Testing: >> mach5 tier1-5 >> >> Future work: >> >> * RelocationHolder::reloc isn't const-correct. Making it so will require >> adjustment of some callers. I'll follow up with an RFE to address this. >> >> * Relocation classes have many virtual function overrides that are unmarked. >> I'll follow up with an RFE to add "override" specifiers. >> >> Potential issue: The removal of RelocationHolder(Relocation*) might not work >> for some platforms. I've tested on platforms supported by Oracle (where there >> was only one (mistaken) use). There might be uses by other platforms. > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > blank lines in include blocks src/hotspot/share/code/relocInfo.hpp line 528: > 526: Relocation* reloc = ctor(_relocbuf); > 527: check_reloc_placement(reloc); > 528: } It seems that the current code is like this, rather than using `typename T, typename...Args` (where `T` is the relocation type), because it is not possible to specify explicit type arguments when calling a constructor template? (So you can't specify `T`). But, you could truck in the type argument on `Construct`, and have it be inferred. So, this could become: Suggestion: template struct Construct {}; // Tag for selecting this constructor. template RelocationHolder(Construct, const Args&...args) { check_reloc_type(); Relocation* reloc = ::new (_relocbuf) T(args...); check_reloc_placement(reloc); } Callers provide e.g. `Construct()` in `construct` or `Construct()` in the default constructor. `construct` can just forward the rest of the args. This seems a little simpler (by avoiding the use of lambdas, and making it more locally obvious that the constructor is doing a placement new into `_relocbuf`), so maybe this is preferable? src/hotspot/share/code/relocInfo.hpp line 544: > 542: return ::new (p) T(args...); > 543: }); > 544: } This would become Suggestion: static RelocationHolder construct(const Args&... args) { return RelocationHolder(Construct(), args...); } src/hotspot/share/code/relocInfo.hpp line 859: > 857: // We never heap allocate a Relocation, so never delete through a base pointer. > 858: // RelocationHolder depends on (and verifies) the destructor for all relocation > 859: // types is trivial, so can't be virtual. Should this be: Suggestion: // types is trivial, so can be non-virtual. ? src/hotspot/share/code/relocInfo.hpp line 892: > 890: inline RelocationHolder::RelocationHolder() : > 891: RelocationHolder(Construct(), [&] (void* p) { return ::new (p) Relocation(); }) > 892: {} And this would become Suggestion: inline RelocationHolder::RelocationHolder() : RelocationHolder(Construct()) {} ------------- PR: https://git.openjdk.org/jdk/pull/11618 From epeter at openjdk.org Thu Dec 15 13:09:16 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Dec 2022 13:09:16 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v2] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 06:21:47 GMT, Emanuel Peter wrote: >> We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. >> >> **Solution** >> Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. >> >> An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove swap files, how did I ever commit them? We discussed it in the office with Christian and Tobias: The issue is that the post-loop LoopNode is removed during CCP. The expectation so far was that they only are removed in IGVN, where in RegionNode::Ideal we catch the death of LoopNodes, and remove their `OpaqueZeroTripGuard` nodes above them. But during CCP, we simply `remove useless` the LoopNode, and its opaque node stays behind. This leads to an inconsistent trip-guard if with only one projection, as the other projection would have lead down to the LoopNode. We propose this point fix for JDK20: We should check if's that sit on the boundary and have one projection removed: if they have an `OpaqueZeroTripGuard` (if they are a zero-trip-guard with opaque node), then we hack the condition value, which makes the if die properly during IGVN after CCP. This is in parallel to the removal logic in `RegionNode::Ideal`. For JDK21 I will also file an RFE to try removing the `OpaqueZeroTripGuardPostLoop` completely (never even insert it). ------------- PR: https://git.openjdk.org/jdk20/pull/22 From roland at openjdk.org Thu Dec 15 13:21:07 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 13:21:07 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v2] In-Reply-To: References: Message-ID: <_nSXuuaXs6pXXcRZoGgm6O8A1b6l8L6nGKkaSp0qcGg=.e108b4c8-0166-4928-a510-dd95bbf6df1f@github.com> On Thu, 15 Dec 2022 13:06:39 GMT, Emanuel Peter wrote: > For JDK21 I will also file an RFE to try removing the `OpaqueZeroTripGuardPostLoop` completely (never even insert it). FWIW, I don't think it's the right way to proceed. Having `OpaqueZeroTripGuardPostLoop` guarantees the zero trip guard for the post loop doesn't go away while we're not done with that loop. It's quite possible the zero trip guard without `OpaqueZeroTripGuardPostLoop` can't fold away either because C2 in its current state can't prove anything about it or because other opaque nodes hide enough type information. It's also possible it can fold away but current testing won't catch it. All sort of questions we don't have to think about as long as `OpaqueZeroTripGuardPostLoop` is here: having `OpaqueZeroTripGuardPostLoop` makes reasoning about the loops easier at the very least. That seems like a precious thing that we would give up only because there's a corner case that's bothering us today. ------------- PR: https://git.openjdk.org/jdk20/pull/22 From epeter at openjdk.org Thu Dec 15 13:30:30 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Dec 2022 13:30:30 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v3] In-Reply-To: References: Message-ID: > **Working on new fix... Will update this later** > > We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. > > **Solution** > Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. > > An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - removing old fix, will push new fix later - Merge branch 'master' into JDK-8298176 - remove swap files, how did I ever commit them? - tab to whitespace - code style improvements - 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears ------------- Changes: - all: https://git.openjdk.org/jdk20/pull/22/files - new: https://git.openjdk.org/jdk20/pull/22/files/b83c995e..d70b385d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk20&pr=22&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk20&pr=22&range=01-02 Stats: 282 lines in 20 files changed: 52 ins; 164 del; 66 mod Patch: https://git.openjdk.org/jdk20/pull/22.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/22/head:pull/22 PR: https://git.openjdk.org/jdk20/pull/22 From bulasevich at openjdk.org Thu Dec 15 13:51:45 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 15 Dec 2022 13:51:45 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v18] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: cleanup, rename and some testing ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/3b9f84e0..b0323003 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=16-17 Stats: 30 lines in 3 files changed: 17 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From roland at openjdk.org Thu Dec 15 14:02:14 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 14:02:14 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v3] In-Reply-To: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: > The problem here is that for arrays, verification code computes a > number of meet operations that grows exponentially with the number of > dimensions while the number of unique meet operations that need to be > computed is a linear function of the number of dimensions: > > > // With verification code, the meet of A and B causes the computation of: > // 1- meet(A, B) > // 2- meet(B, A) > // 3- meet(dual(meet(A, B)), dual(A)) > // 4- meet(dual(meet(A, B)), dual(B)) > // 5- meet(dual(A), dual(B)) > // 6- meet(dual(B), dual(A)) > // 7- meet(dual(meet(dual(A), dual(B))), A) > // 8- meet(dual(meet(dual(A), dual(B))), B) > // > // In addition the meet of A[] and B[] requires the computation of the meet of A and B. > // > // The meet of A[] and B[] triggers the computation of: > // 1- meet(A[], B[][) > // 1.1- meet(A, B) > // 1.2- meet(B, A) > // 1.3- meet(dual(meet(A, B)), dual(A)) > // 1.4- meet(dual(meet(A, B)), dual(B)) > // 1.5- meet(dual(A), dual(B)) > // 1.6- meet(dual(B), dual(A)) > // 1.7- meet(dual(meet(dual(A), dual(B))), A) > // 1.8- meet(dual(meet(dual(A), dual(B))), B) > // 2- meet(B[], A[]) > // 2.1- meet(B, A) = 1.2 > // 2.2- meet(A, B) = 1.1 > // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 > // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 > // 2.5- meet(dual(B), dual(A)) = 1.6 > // 2.6- meet(dual(A), dual(B)) = 1.5 > // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 > // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 > // etc. > > > > There are a lot of redundant computations being performed. The fix I > propose is simply to cache the result of meet computations. So whene > the type system code is called to compute, for instance, the meet of > A[][] and B[][], the cache starts empty. Then as the meet computations > proceed, the cache is filled with meet result for meet of A[] and B[], > meet of A and B etc. Once the type system code returns with the result > for A[][] and B[][], the cache is cleared. > > With this, the test case I added goes from "never seem to ever finish" > to "complete in no time". Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Vladimir's review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11673/files - new: https://git.openjdk.org/jdk/pull/11673/files/83529981..0794dacb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11673&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11673&range=01-02 Stats: 209 lines in 4 files changed: 92 ins; 70 del; 47 mod Patch: https://git.openjdk.org/jdk/pull/11673.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11673/head:pull/11673 PR: https://git.openjdk.org/jdk/pull/11673 From roland at openjdk.org Thu Dec 15 14:02:15 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 14:02:15 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v2] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: On Wed, 14 Dec 2022 13:56:49 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > build fix @vnkozlov I think I addressed all your comments. ------------- PR: https://git.openjdk.org/jdk/pull/11673 From bulasevich at openjdk.org Thu Dec 15 15:47:13 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 15 Dec 2022 15:47:13 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v17] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 13:27:50 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> a few minor changes > > src/hotspot/share/code/compressedStream.cpp line 119: > >> 117: bool CompressedSparseDataReadStream::read_zero() { >> 118: if (_buffer[_position] & (1 << (7 - _bit_position))) { >> 119: return 0; // not a zero data > > As the return type is `bool`, let's use `false` and 'true` instead of `0` and `1`. ok ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Thu Dec 15 15:52:19 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 15 Dec 2022 15:52:19 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v17] In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 14:02:44 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> a few minor changes > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 40: > >> 38: } >> 39: >> 40: public byte readByteImpl() { > > Should it be private? ok > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 55: > >> 53: byte b = readByteImpl(); >> 54: int result = b & 0x3f; >> 55: for (int i = 0; (0 != ((i == 0) ? (b & 0x40) : (b & 0x80))); i++) { > > It is difficult to read this. Could we rewrite it into: > > int result = b & 0x3f; > if ((b & 0x40) != 0) { > b = readByteImpl(); > result |= (b & 0x7f) << 6; > for (int i = 1; (b & 0x80) != 0; ++i) { > b = readByteImpl(); > result |= ((b & 0x7f) << (6 + 7 * i)); > } > } > > > BTW, we can simplify the loop in the `CompressedSparseDataReadStream::read_int()` in the same way. If you don't mind, I would prefer shorter version. for (int i = 0; (0 != ((i == 0) ? (b & 0x40) : (b & 0x80))); i++) { b = readByteImpl(); result |= ((b & 0x7f) << (6 + 7 * i)); } -> int result = b & 0x3f; if ((b & 0x40) != 0) { b = readByteImpl(); result |= (b & 0x7f) << 6; for (int i = 1; (b & 0x80) != 0; ++i) { b = readByteImpl(); result |= ((b & 0x7f) << (6 + 7 * i)); } } > test/hotspot/jtreg/serviceability/sa/TestCompressedSparseDataReadStream.java line 51: > >> 49: assertEquals(in.readInt(), 0); // zero bit -> 0 >> 50: in.setPosition(2); >> 51: assertEquals(in.readInt(), 48); // 0xf000 -> 48 > > Should we test `readBoolean` and `readByte`? I understand they are implemented with `readInt`. They might incorrectly convert data returned by `readInt`. ok ------------- PR: https://git.openjdk.org/jdk/pull/10025 From ecaspole at openjdk.org Thu Dec 15 16:13:48 2022 From: ecaspole at openjdk.org (Eric Caspole) Date: Thu, 15 Dec 2022 16:13:48 GMT Subject: RFR: 8298809: Clean up vm/compiler/InterfaceCalls JMH Message-ID: I removed some confusing less effective cases and modified and renamed some to cover what seem like the most useful cases with 1+ types and 1+ interfaces implemented in those types. Here is an example run: Benchmark Mode Cnt Score Error Units InterfaceCalls.test1stInt2Types avgt 12 2.196 ? 0.022 ns/op InterfaceCalls.test1stInt3Types avgt 12 8.259 ? 0.045 ns/op InterfaceCalls.test1stInt5Types avgt 12 8.279 ? 0.024 ns/op InterfaceCalls.test2ndInt2Types avgt 12 2.467 ? 0.023 ns/op InterfaceCalls.test2ndInt3Types avgt 12 9.287 ? 0.032 ns/op InterfaceCalls.test2ndInt5Types avgt 12 9.343 ? 0.027 ns/op InterfaceCalls.testMonomorphic avgt 12 1.440 ? 0.031 ns/op ------------- Commit messages: - 8298809: Clean up vm/compiler/InterfaceCalls JMH Changes: https://git.openjdk.org/jdk/pull/11696/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11696&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298809 Stats: 174 lines in 1 file changed: 5 ins; 123 del; 46 mod Patch: https://git.openjdk.org/jdk/pull/11696.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11696/head:pull/11696 PR: https://git.openjdk.org/jdk/pull/11696 From epeter at openjdk.org Thu Dec 15 16:28:27 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Dec 2022 16:28:27 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears [v4] In-Reply-To: References: Message-ID: > **Working on new fix... Will update this later** > > We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. > > **Solution** > Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. > > An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: new fix, where we remove opaque nodes when loopnode dies during CCP ------------- Changes: - all: https://git.openjdk.org/jdk20/pull/22/files - new: https://git.openjdk.org/jdk20/pull/22/files/d70b385d..792792e9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk20&pr=22&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk20&pr=22&range=02-03 Stats: 85 lines in 7 files changed: 39 ins; 31 del; 15 mod Patch: https://git.openjdk.org/jdk20/pull/22.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/22/head:pull/22 PR: https://git.openjdk.org/jdk20/pull/22 From eastigeevich at openjdk.org Thu Dec 15 17:14:12 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 15 Dec 2022 17:14:12 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v18] In-Reply-To: References: Message-ID: <92P-qKSCmugCjtn_Kcr7GreN983K5A5cPvM_omcq5bg=.645f742f-5ed1-46db-97dc-056a87a5d2eb@github.com> On Thu, 15 Dec 2022 13:51:45 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup, rename and some testing I have no other comments. Lgtm. ------------- Marked as reviewed by eastigeevich (Committer). PR: https://git.openjdk.org/jdk/pull/10025 From roland at openjdk.org Thu Dec 15 17:59:17 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 15 Dec 2022 17:59:17 GMT Subject: RFR: 8297724: Loop strip mining prevents some empty loops from being eliminated Message-ID: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> When an empty loop is found, it's removed and as a consequence the outer strip mine loop and the safepoint that it contains are also removed. A counted loop is empty if it has the minimum number of nodes that a well formed counted loop contains. In some cases, the loop has extra nodes and the safepoint in the outer loop is the only node that keeps those extra nodes alive. If the safepoint was to be removed, then the counted loop would have the minimum number of nodes and be considered empty. But the safepoint can't be removed until the loop is considered empty which only happens if it has the minimum of nodes. As a result, these loops are not removed. Note that now that the loop strip mining loop nest is constructed even if UseCountedLoopSafepoints is false, there's a regression where some loops used to be removed as empty before but not anymore. The fix I propose is to extend IdealLoopTree::do_remove_empty_loop() so it handles those cases. If it encounters a loop with no flow control in the loop body but a number of nodes greater than the minimum number of nodes, it starts from the extra nodes in the loop body and follows uses until it finds a side effect, ignoring the safepoint of the outer loop. If it finds none, then the extra nodes can be removed and the loop is empty. This also works if the extra nodes are kept alive by the safepoints of 2 different counted loops and one can only be proven empty if the other one is as well (and the other one proven empty if the first one is) and should work even if there are more than 2 nodes involved.. ------------- Commit messages: - whitespaces - test & fix Changes: https://git.openjdk.org/jdk/pull/11699/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11699&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297724 Stats: 303 lines in 3 files changed: 295 ins; 4 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/11699.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11699/head:pull/11699 PR: https://git.openjdk.org/jdk/pull/11699 From kvn at openjdk.org Thu Dec 15 18:08:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 18:08:08 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 09:01:42 GMT, Pengfei Li wrote: > In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. > > This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. > > Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. > > A few more test cases are added within this patch as well. > > We tested the new IR rules on below kinds of CPUs. > - AArch64 w/ 512-bit SVE > - AArch64 w/ 128-bit SVE > - AArch64 w/o SVE (NEON only) > - x86 Good work. But I think you should add x86 IR testing too. It could be done by adding `"sse2", "true"` to `applyIfCPUFeatureOr`. Or high features (avx2, avx512) if required. You should also consider reducing time to run these tests since you added additional testing. May be lower value of `SIZE`. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11687 From kvn at openjdk.org Thu Dec 15 18:59:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 18:59:07 GMT Subject: RFR: 8298848: C2: clone all of (CmpP (LoadKlass (AddP down at split if In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 10:53:37 GMT, Roland Westrelin wrote: > As suggested by Vladimir in: > https://github.com/openjdk/jdk/pull/11666 > > Thus extract one for the fixes as a separate PR. The bug as described > in the above PR is: > > The crash occurs because a` (If (Bool (CmpP (LoadKlass ..))))` > only has a single projection. It lost the other projection because of > a `CheckCastPP` that becomes `top`. Initially the pattern is, in pseudo > code: > > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > > `obj` itself is a `CheckCastPP` that's pinned at a dominating if. That > dominating if goes through split through phi. The `LoadKlass` for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > > with` phi1 = (Phi (LoadKlass obj) (LoadKlass obj))` and phi2 = (Phi obj obj) > with `obj = (CheckCastPP#2 obj')` > > `PhiNode::Ideal()` transforms `phi2` into a new `CheckCastPP`: > `(CheckCastPP#3 obj' obj') `with control set to the region right above > the if in the pseudo code above. There happens to be another > `CheckCastPP` at the same control which casts obj' to a narrower > type. So the new `CheckCastPP#3` is replaced by that one (because of > `ConstraintCastNode::dominating_cast()`) and pseudo code becomes: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > > and then: > > > if (phi1 == some_class) { > obj = top; > } > > > because the types of the 2 `CheckCastPP`s conflict. That would be ok if: > > `phi1 == some_class` > > would constant fold. It would if the test was: > > `if (CheckCastPP#4(obj').klass == some_klass) { > ` > but because of split if, the `(CmpP (LoadKlass ..))` and the > `CheckCastPP#1` ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > `(Bool (CmpP (LoadKlass (AddP ..))))` > > down the same way `(Bool (CmpP ..))` is cloned down. After split if, the > pseudo code becomes: > > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > > The bug can't occur because the `CheckCastPP` and` (CmpP (LoadKlass ..))` > operate on the same phi input. The change in split_if.cpp implements > that. Looks good. I will test it. ------------- PR: https://git.openjdk.org/jdk/pull/11689 From shade at openjdk.org Thu Dec 15 19:13:06 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 15 Dec 2022 19:13:06 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 13:57:12 GMT, Fei Yang wrote: >> The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 >> It looks to me that the fix for the AArch64 port is a nice refactoring work. >> This fixes this issue for the RISC-V port with a similar approach. >> >> Testing: >> Tier1 tested with release build on linux-riscv64 unmatched board. >> Run non-trivial benchmark workloads with fastdebug builds. > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Review AArch64 patch (https://github.com/openjdk/jdk/commit/5a5ced3a900a81fd0b0757017f4138ce97e2521e) also has modifications that rewrite `_offset` -> `offset()`, `_base` -> `base()`, `_index` -> `index()`, etc. Does RISC-V code has the relevant uses too? ------------- PR: https://git.openjdk.org/jdk/pull/11505 From kvn at openjdk.org Thu Dec 15 19:16:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 19:16:06 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 11:15:49 GMT, Roland Westrelin wrote: >> This PR re-does 6312651 (Compiler should only use verified interface >> types for optimization) with a couple fixes I had pushed afterward >> (8297556 and 8297343) and fixes for some other issues. >> >> The trickiest one is a fix for 8297345 (C2: SIGSEGV in >> PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a >> test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) >> only has a single projection. It lost the other projection because of >> a CheckCastPP that becomes top. Initially the pattern is, in pseudo >> code,: >> >> if (obj.klass == some_class) { >> obj = CheckCastPP#1(obj); >> } >> >> obj itself is a CheckCastPP that's pinned at a dominating if. That >> dominating if goes through split through phi. The LoadKlass for the >> pseudo code above also has control set to the dominating if being >> transformed. This result in: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(phi2); >> } >> >> with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) >> with obj = (CheckCastPP#2 obj') >> >> PhiNode::Ideal() transforms phi2 into a new CheckCastPP: >> (CheckCastPP#3 obj' obj') with control set to the region right above >> the if in the pseudo code above. There happens to be another >> CheckCastPP at the same control which casts obj' to a narrower >> type. So the new CheckCastPP#3 is replaced by that one (because of >> ConstraintCastNode::dominating_cast())and pseudo code becomes: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(CheckCastPP#4(obj')); >> } >> >> and then: >> >> if (phi1 == some_class) { >> obj = top; >> } >> >> because the types of the 2 CheckCastPPs conflict. That would be ok if: >> >> phi1 == some_class >> >> would constant fold. It would if the test was: >> >> if (CheckCastPP#4(obj').klass == some_klass) { >> >> but because of split if, the (CmpP (LoadKlass ..)) and the >> CheckCastPP#1 ended up with 2 different object inputs that then were >> transformed differently. The fix I propose is to have split if clone the entire: >> >> (Bool (CmpP (LoadKlass (AddP ..)))) >> >> down the same way (Bool (CmpP ..)) is cloned down. After split if, the >> pseudo code becomes: >> >> if (phi.klass == some_class) { >> obj = CheckCastPP#1(phi); >> } >> >> The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) >> operate on the same phi input. The change in split_if.cpp implements >> that. >> >> The other fixes are: >> >> - arraycopynode.cpp: a crash happens because dest_offset and >> src_offset are the same. The call to transform that results in >> src_scale, causes src_offset (and thus dest_offset) to become >> dead. The fix is to add a hook node to preserve dest_offset. This is >> unrelated to 6312651 but it triggers with that change for some >> reason. >> >> - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code >> that the change in the handling of interfaces make obsolete and that >> I missed in the PR for 6312651. >> >> - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare >> assert when during CCP, Value() is called with an input raw constant >> ptr. >> >> - type.cpp: a _klass = NULL field in arrays used to indicate only top >> or bottom but I changed that so _klass is only guaranteed non null >> for basic type arrays. The fix in type.cpp updates a piece of code >> that I didn't adapt to the new meaning of _klass = NULL. >> >> - the other changes are due to StressReflectiveCode. With 6312651, a >> CheckCastPP can fold to top if it sees a type for its input that >> conflicts with its own type. That wasn't the case before. So if a >> type check fails, a CheckCastPP will fold to top and the control >> flow branch it's in must die. That doesn't always happen with >> StressReflectiveCode: the CheckCastPP folds but not the control flow >> path. With ExpandSubTypeCheckAtParseTime on, that's because of a >> code path in LoadNode::Value() that's disabled with >> StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's >> because Compile::static_subtype_check() is always pessimistic with >> StressReflectiveCode but it's used by SubTypeCheckNode::sub() to >> find when a node can constant fold. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > extract interfaces->length() in ciInstanceKlass.cpp Looks good to me. Lets finish it after #11689 is pushed. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From kvn at openjdk.org Thu Dec 15 19:41:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 19:41:06 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v3] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: On Thu, 15 Dec 2022 14:02:14 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Vladimir's review Nice. I will submit testing. You need second review. ------------- PR: https://git.openjdk.org/jdk/pull/11673 From vlivanov at openjdk.org Thu Dec 15 19:52:08 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 15 Dec 2022 19:52:08 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v3] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: <3EkwzbYdOtYyUWLWZ7stwwZXhoNikxg2j3xp7zEojoQ=.1437737f-bc8a-45e3-b907-64e546b07b9f@github.com> On Thu, 15 Dec 2022 14:02:14 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Vladimir's review Looks good. src/hotspot/share/opto/type.cpp line 989: > 987: const Type *mt = this_t->xmeet(t); > 988: #ifdef ASSERT > 989: VerifyMeetResult* type_verify = C->_type_verify; As a style nit, you could access `VerifyMeetResult*` through `VerifyMeetMark` or even encapsulate it there. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/11673 From kvn at openjdk.org Thu Dec 15 20:03:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 20:03:07 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v3] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: <4KCBD07b2KX2keobGgJJ1GtBivlKP1bb6roo6zqOLaY=.8d64f18e-2c96-41ba-912f-be0cfc6931cf@github.com> On Thu, 15 Dec 2022 14:02:14 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Vladimir's review src/hotspot/share/opto/type.cpp line 981: > 979: #ifdef ASSERT > 980: Compile* C = Compile::current(); > 981: VerifyMeetMark verif(C); I missed this. `verif(C)` -> `verify()` ------------- PR: https://git.openjdk.org/jdk/pull/11673 From kvn at openjdk.org Thu Dec 15 20:13:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 20:13:06 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v3] In-Reply-To: <3EkwzbYdOtYyUWLWZ7stwwZXhoNikxg2j3xp7zEojoQ=.1437737f-bc8a-45e3-b907-64e546b07b9f@github.com> References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> <3EkwzbYdOtYyUWLWZ7stwwZXhoNikxg2j3xp7zEojoQ=.1437737f-bc8a-45e3-b907-64e546b07b9f@github.com> Message-ID: <7fnUt031zLLAQTVGf0S5rzB53WQfRs8XSnCiaP05wy8=.19107043-b227-4e2e-bd5f-29f6d91746d8@github.com> On Thu, 15 Dec 2022 19:48:37 GMT, Vladimir Ivanov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Vladimir's review > > src/hotspot/share/opto/type.cpp line 989: > >> 987: const Type *mt = this_t->xmeet(t); >> 988: #ifdef ASSERT >> 989: VerifyMeetResult* type_verify = C->_type_verify; > > As a style nit, you could access `VerifyMeetResult*` through `VerifyMeetMark` or even encapsulate it there. I assume you suggested to hide access to cache (`add()`, `meet()`) in `VerifyMeetMark`. I agree. `VerifyMeetMark` can be renamed `VerifyMeet` in such case. ------------- PR: https://git.openjdk.org/jdk/pull/11673 From kvn at openjdk.org Thu Dec 15 20:23:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 20:23:05 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests In-Reply-To: References: Message-ID: <5Q2QZBQTIGSFNc2uls4JhLjQWaE1BRPZDD6DC_xjnTQ=.3b0a56e6-f1e3-47ca-a3d6-93faee9e8f2e@github.com> On Thu, 15 Dec 2022 18:05:35 GMT, Vladimir Kozlov wrote: > Good work. But I think you should add x86 IR testing too. It could be done by adding `"sse2", "true"` to `applyIfCPUFeatureOr`. Or high features (avx2, avx512) if required. I tested and almost all tests passed if I use "avx2" instead of my original "sse2". Few tests requires "avx512dq": ArrayTypeConvertTest.convertDoubleToLong() ArrayTypeConvertTest.convertFloatToLong() ArrayTypeConvertTest.convertLongToDouble() ArrayTypeConvertTest.convertLongToFloat() BasicLongOpTest.vectorAbs() needs "avx512vl" ------------- PR: https://git.openjdk.org/jdk/pull/11687 From kvn at openjdk.org Thu Dec 15 20:26:03 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 20:26:03 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 09:01:42 GMT, Pengfei Li wrote: > In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. > > This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. > > Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. > > A few more test cases are added within this patch as well. > > We tested the new IR rules on below kinds of CPUs. > - AArch64 w/ 512-bit SVE > - AArch64 w/ 128-bit SVE > - AArch64 w/o SVE (NEON only) > - x86 Don't forget to change `applyIfCPUFeature` to `applyIfCPUFeatureOr` And to run these tests with different AVX configuration we need to remove `vm.flagless` from `@requires`. `-XX:UseAVX=n` flag is supported by IR testing now. ------------- PR: https://git.openjdk.org/jdk/pull/11687 From kvn at openjdk.org Thu Dec 15 20:36:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 20:36:04 GMT Subject: RFR: 8298809: Clean up vm/compiler/InterfaceCalls JMH In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 16:05:49 GMT, Eric Caspole wrote: > I removed some confusing less effective cases and modified and renamed some to cover what seem like the most useful cases with 1+ types and 1+ interfaces implemented in those types. Here is an example run: > > Benchmark Mode Cnt Score Error Units > InterfaceCalls.test1stInt2Types avgt 12 2.196 ? 0.022 ns/op > InterfaceCalls.test1stInt3Types avgt 12 8.259 ? 0.045 ns/op > InterfaceCalls.test1stInt5Types avgt 12 8.279 ? 0.024 ns/op > InterfaceCalls.test2ndInt2Types avgt 12 2.467 ? 0.023 ns/op > InterfaceCalls.test2ndInt3Types avgt 12 9.287 ? 0.032 ns/op > InterfaceCalls.test2ndInt5Types avgt 12 9.343 ? 0.027 ns/op > InterfaceCalls.testMonomorphic avgt 12 1.440 ? 0.031 ns/op To be consistent I think we should rename interfaces and their methods: interface FirstInterface { public int getIntFirst(); } interface SecondInterface { public int getIntSecond(); } ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11696 From kvn at openjdk.org Thu Dec 15 20:55:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 20:55:05 GMT Subject: RFR: 8297724: Loop strip mining prevents some empty loops from being eliminated In-Reply-To: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> References: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> Message-ID: On Thu, 15 Dec 2022 16:43:07 GMT, Roland Westrelin wrote: > When an empty loop is found, it's removed and as a consequence the > outer strip mine loop and the safepoint that it contains are also > removed. A counted loop is empty if it has the minimum number of nodes > that a well formed counted loop contains. In some cases, the loop has > extra nodes and the safepoint in the outer loop is the only node that > keeps those extra nodes alive. If the safepoint was to be removed, > then the counted loop would have the minimum number of nodes and be > considered empty. But the safepoint can't be removed until the loop is > considered empty which only happens if it has the minimum of nodes. As > a result, these loops are not removed. Note that now that the loop > strip mining loop nest is constructed even if UseCountedLoopSafepoints > is false, there's a regression where some loops used to be removed as > empty before but not anymore. > > The fix I propose is to extend IdealLoopTree::do_remove_empty_loop() > so it handles those cases. If it encounters a loop with no flow > control in the loop body but a number of nodes greater than the > minimum number of nodes, it starts from the extra nodes in the loop > body and follows uses until it finds a side effect, ignoring the > safepoint of the outer loop. If it finds none, then the extra nodes > can be removed and the loop is empty. This also works if the extra > nodes are kept alive by the safepoints of 2 different counted loops > and one can only be proven empty if the other one is as well (and the > other one proven empty if the first one is) and should work even if > there are more than 2 nodes involved.. Looks good to me. Thank you for fixing it. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11699 From dchuyko at openjdk.org Thu Dec 15 21:07:28 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 15 Dec 2022 21:07:28 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: References: Message-ID: > This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html > > In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. > > The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern introduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have a n 'immI_M1' input. > > New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also show the changed code with `-prof perfasm`. > > Typical nano-benchmark with a loop and a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usually there are enough registers. However special nano-benchmarks can be considered, e.g. > > > @Benchmark > @OperationsPerInvocation(TESTSIZE) > public int max0_use8_i() { > int sum = 0; > for(int i = 0; i < TESTSIZE; i++) { > use8(0, 1, 2, 3, 4, 5, 6, 7); > sum += Math.max(i, 0); > } > return sum; > } > > @CompilerControl(CompilerControl.Mode.DONT_INLINE) > public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { > } > > > Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. > > New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms (release build). > > Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: Reverted Ideal change, moved definitions to m4 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11570/files - new: https://git.openjdk.org/jdk/pull/11570/files/9bf234d5..0b9ed33f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11570&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11570&range=00-01 Stats: 617 lines in 3 files changed: 389 ins; 220 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/11570.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11570/head:pull/11570 PR: https://git.openjdk.org/jdk/pull/11570 From vlivanov at openjdk.org Thu Dec 15 21:08:07 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 15 Dec 2022 21:08:07 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v3] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: <_TcK5wY8noJtS8u-XsuM1L3hq9hnJzbscMqRUsYUf98=.fed6e48e-45a4-4bc9-af63-2ccc8abc5aa1@github.com> On Thu, 15 Dec 2022 14:02:14 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Vladimir's review Changes requested by vlivanov (Reviewer). src/hotspot/share/opto/type.hpp line 188: > 186: } > 187: > 188: void assert_type_verify_empty() const PRODUCT_RETURN; Just noticed that `PRODUCT_RETURN` breaks optimized build (`Type::assert_type_verify_empty()` is under `ASSERT`). ------------- PR: https://git.openjdk.org/jdk/pull/11673 From vlivanov at openjdk.org Thu Dec 15 21:25:08 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 15 Dec 2022 21:25:08 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 11:15:49 GMT, Roland Westrelin wrote: >> This PR re-does 6312651 (Compiler should only use verified interface >> types for optimization) with a couple fixes I had pushed afterward >> (8297556 and 8297343) and fixes for some other issues. >> >> The trickiest one is a fix for 8297345 (C2: SIGSEGV in >> PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a >> test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) >> only has a single projection. It lost the other projection because of >> a CheckCastPP that becomes top. Initially the pattern is, in pseudo >> code,: >> >> if (obj.klass == some_class) { >> obj = CheckCastPP#1(obj); >> } >> >> obj itself is a CheckCastPP that's pinned at a dominating if. That >> dominating if goes through split through phi. The LoadKlass for the >> pseudo code above also has control set to the dominating if being >> transformed. This result in: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(phi2); >> } >> >> with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) >> with obj = (CheckCastPP#2 obj') >> >> PhiNode::Ideal() transforms phi2 into a new CheckCastPP: >> (CheckCastPP#3 obj' obj') with control set to the region right above >> the if in the pseudo code above. There happens to be another >> CheckCastPP at the same control which casts obj' to a narrower >> type. So the new CheckCastPP#3 is replaced by that one (because of >> ConstraintCastNode::dominating_cast())and pseudo code becomes: >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(CheckCastPP#4(obj')); >> } >> >> and then: >> >> if (phi1 == some_class) { >> obj = top; >> } >> >> because the types of the 2 CheckCastPPs conflict. That would be ok if: >> >> phi1 == some_class >> >> would constant fold. It would if the test was: >> >> if (CheckCastPP#4(obj').klass == some_klass) { >> >> but because of split if, the (CmpP (LoadKlass ..)) and the >> CheckCastPP#1 ended up with 2 different object inputs that then were >> transformed differently. The fix I propose is to have split if clone the entire: >> >> (Bool (CmpP (LoadKlass (AddP ..)))) >> >> down the same way (Bool (CmpP ..)) is cloned down. After split if, the >> pseudo code becomes: >> >> if (phi.klass == some_class) { >> obj = CheckCastPP#1(phi); >> } >> >> The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) >> operate on the same phi input. The change in split_if.cpp implements >> that. >> >> The other fixes are: >> >> - arraycopynode.cpp: a crash happens because dest_offset and >> src_offset are the same. The call to transform that results in >> src_scale, causes src_offset (and thus dest_offset) to become >> dead. The fix is to add a hook node to preserve dest_offset. This is >> unrelated to 6312651 but it triggers with that change for some >> reason. >> >> - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code >> that the change in the handling of interfaces make obsolete and that >> I missed in the PR for 6312651. >> >> - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare >> assert when during CCP, Value() is called with an input raw constant >> ptr. >> >> - type.cpp: a _klass = NULL field in arrays used to indicate only top >> or bottom but I changed that so _klass is only guaranteed non null >> for basic type arrays. The fix in type.cpp updates a piece of code >> that I didn't adapt to the new meaning of _klass = NULL. >> >> - the other changes are due to StressReflectiveCode. With 6312651, a >> CheckCastPP can fold to top if it sees a type for its input that >> conflicts with its own type. That wasn't the case before. So if a >> type check fails, a CheckCastPP will fold to top and the control >> flow branch it's in must die. That doesn't always happen with >> StressReflectiveCode: the CheckCastPP folds but not the control flow >> path. With ExpandSubTypeCheckAtParseTime on, that's because of a >> code path in LoadNode::Value() that's disabled with >> StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's >> because Compile::static_subtype_check() is always pessimistic with >> StressReflectiveCode but it's used by SubTypeCheckNode::sub() to >> find when a node can constant fold. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > extract interfaces->length() in ciInstanceKlass.cpp Looks good. src/hotspot/share/ci/ciArrayKlass.hpp line 59: > 57: > 58: static ciArrayKlass* make(ciType* element_type); > 59: Redundant. A leftover from reverted changes? ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/11666 From dchuyko at openjdk.org Thu Dec 15 21:29:08 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 15 Dec 2022 21:29:08 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 18:52:29 GMT, Andrew Haley wrote: >> Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: >> >> Reverted Ideal change, moved definitions to m4 > > src/hotspot/cpu/aarch64/aarch64.ad line 15816: > >> 15814: %} >> 15815: %} >> 15816: > > Please put all this repetitive stuff into aarch64_ad.m4 and we'll review that. As min/max ideal transformation https://git.openjdk.org/jdk/pull/9703 looks into child nodes with add and nested min/max, it is tricky to add guarantee of constant being in a fixed input (initial PR variant broke the test ). I rolled the Ideal change back but it leads to doubling of matching rules. So thanks for the advice, that makes even more sense to move things to .m4. I've done that. CMOV_INSN macro declares instruct-s for generic min/max case with csel. CMOV_DRAW_INSN macro declares instruct-s that "draw" one of (-1, 0, 1) constants using zr. MINMAX_DRAW_INSN macro declares instruct-s with matching rules for (-1, 0, 1) immediate operand and a generic register in different order. There are also helper string macros for upper and lower case to reduce the number of word repetitions. Existing generic min and max rules were renamed for consistency. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From jrose at openjdk.org Thu Dec 15 21:37:06 2022 From: jrose at openjdk.org (John R Rose) Date: Thu, 15 Dec 2022 21:37:06 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> Message-ID: <6HhzRpDqYnqA0tFlewoHqoGGrhE1YacqAIXoxss4ibg=.bc80bfc5-9992-4be0-bc19-eebd3a106e62@github.com> On Mon, 12 Dec 2022 17:56:11 GMT, Kim Barrett wrote: >> Please review this change to construction and copying of the Relocation and >> RelocationHolder classes, to eliminate some questionable C++ usage. >> >> The array type for RelocationHandle::_relocbuf is changed from void* to char, >> because using a char array for raw memory is countenanced by the standard, >> while not so much for an array of void*. The desired alignment is maintained >> via a union, since using alignas is not (yet) permitted in HotSpot code. >> >> There is also now a comment discussing the use of _relocbuf in more detail, >> including some areas of continued sketchiness wrto standard conformance and >> reliance on implementation dependent behavior. >> >> No longer use trivial copy and assignment for RelocationHolder, since that >> isn't technically correct. The Relocation in the holder is not trivially >> copyable, since it is polymorphic. It seemed to work in practice with the >> supported compilers, but we shouldn't (and don't need to) rely on it. Instead >> we have a new virtual function Relocation::copy_into that copies the most >> derived object into the holder's _relocbuf using placement new. >> >> Eliminated the implict conversion constructor from Relocation to holder that >> wordwise copied (to possibly beyond the end of) the Relocation into the >> holder's _relocbuf. We could have implemented this more carefully with the >> new approach (using copy_into), but we don't actually need this conversion. >> The only use of it was actually a wasted copy (in assembler_x86.cpp). >> >> Eliminated the use of placement new syntax via operator new with a holder >> argument to copy a Resource object into a holder. This included runtime >> verification that the size of the object is not greater than the size of >> _relocbuf; we now do corresponding verification at compile-time. This also >> included an incorrect attempt at a runtime check that the Relocation base >> class would be at the same address as the derived class being constructed; we >> now perform that check correctly. We also discuss in a comment the layout >> assumption being made (that isn't required by the standard but is provided by >> all supported compilers), and what to do if we encounter a compiler that >> behaves differently. >> >> Eliminated the idiom of making a default-constructed holder and then >> overwriting its held relocation with a newly constructed one, using the afore >> mentioned (and eliminated) operator new. Instead, RelocationHolder now has a >> factory function template (construct) for creating holders with a given >> relocation type, constructed using provided arguments. (The arguments are taken >> as const-ref rather than using perfect forwarding, as the tools for the latter >> are not (yet) approved for use in HotSpot. Using const-ref is good enough in >> this case.) >> >> Describe and verify other assumptions being made, such as all Relocation >> classes being trivially destructible. >> >> Testing: >> mach5 tier1-5 >> >> Future work: >> >> * RelocationHolder::reloc isn't const-correct. Making it so will require >> adjustment of some callers. I'll follow up with an RFE to address this. >> >> * Relocation classes have many virtual function overrides that are unmarked. >> I'll follow up with an RFE to add "override" specifiers. >> >> Potential issue: The removal of RelocationHolder(Relocation*) might not work >> for some platforms. I've tested on platforms supported by Oracle (where there >> was only one (mistaken) use). There might be uses by other platforms. > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > blank lines in include blocks Looks good. Thanks for bringing this code into the future. I suggest putting all the static asserts in one place, instead of two of them in place A and the third (trivial destructor) in place B. Task for another day: Figure out best practices for doing flyweight objects in C++, and package them up for application elsewhere. I think Valhalla could use virtual flyweight field descriptors. Alternative task for another day: Figure out best practices for *avoiding* flyweight objects, and use those instead. You can mark me as a reviewer. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From ecaspole at openjdk.org Thu Dec 15 21:57:36 2022 From: ecaspole at openjdk.org (Eric Caspole) Date: Thu, 15 Dec 2022 21:57:36 GMT Subject: RFR: 8298809: Clean up vm/compiler/InterfaceCalls JMH [v2] In-Reply-To: References: Message-ID: > I removed some confusing less effective cases and modified and renamed some to cover what seem like the most useful cases with 1+ types and 1+ interfaces implemented in those types. Here is an example run: > > Benchmark Mode Cnt Score Error Units > InterfaceCalls.test1stInt2Types avgt 12 2.196 ? 0.022 ns/op > InterfaceCalls.test1stInt3Types avgt 12 8.259 ? 0.045 ns/op > InterfaceCalls.test1stInt5Types avgt 12 8.279 ? 0.024 ns/op > InterfaceCalls.test2ndInt2Types avgt 12 2.467 ? 0.023 ns/op > InterfaceCalls.test2ndInt3Types avgt 12 9.287 ? 0.032 ns/op > InterfaceCalls.test2ndInt5Types avgt 12 9.343 ? 0.027 ns/op > InterfaceCalls.testMonomorphic avgt 12 1.440 ? 0.031 ns/op Eric Caspole has updated the pull request incrementally with one additional commit since the last revision: Renaming as Vladimir suggested. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11696/files - new: https://git.openjdk.org/jdk/pull/11696/files/6d1a5c26..fee9d896 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11696&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11696&range=00-01 Stats: 29 lines in 1 file changed: 0 ins; 0 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/11696.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11696/head:pull/11696 PR: https://git.openjdk.org/jdk/pull/11696 From ecaspole at openjdk.org Thu Dec 15 21:57:36 2022 From: ecaspole at openjdk.org (Eric Caspole) Date: Thu, 15 Dec 2022 21:57:36 GMT Subject: RFR: 8298809: Clean up vm/compiler/InterfaceCalls JMH In-Reply-To: References: Message-ID: <00dFyQOyIyA9FlZXCCbfq1kiPw25YEJUb88Jg8UPgXA=.ae604d5f-b622-452a-aed8-47c157f5c417@github.com> On Thu, 15 Dec 2022 16:05:49 GMT, Eric Caspole wrote: > I removed some confusing less effective cases and modified and renamed some to cover what seem like the most useful cases with 1+ types and 1+ interfaces implemented in those types. Here is an example run: > > Benchmark Mode Cnt Score Error Units > InterfaceCalls.test1stInt2Types avgt 12 2.196 ? 0.022 ns/op > InterfaceCalls.test1stInt3Types avgt 12 8.259 ? 0.045 ns/op > InterfaceCalls.test1stInt5Types avgt 12 8.279 ? 0.024 ns/op > InterfaceCalls.test2ndInt2Types avgt 12 2.467 ? 0.023 ns/op > InterfaceCalls.test2ndInt3Types avgt 12 9.287 ? 0.032 ns/op > InterfaceCalls.test2ndInt5Types avgt 12 9.343 ? 0.027 ns/op > InterfaceCalls.testMonomorphic avgt 12 1.440 ? 0.031 ns/op Thanks Vladimir, good suggestion! ------------- PR: https://git.openjdk.org/jdk/pull/11696 From kvn at openjdk.org Thu Dec 15 22:32:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Dec 2022 22:32:05 GMT Subject: RFR: 8298809: Clean up vm/compiler/InterfaceCalls JMH [v2] In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 21:57:36 GMT, Eric Caspole wrote: >> I removed some confusing less effective cases and modified and renamed some to cover what seem like the most useful cases with 1+ types and 1+ interfaces implemented in those types. Here is an example run: >> >> Benchmark Mode Cnt Score Error Units >> InterfaceCalls.test1stInt2Types avgt 12 2.196 ? 0.022 ns/op >> InterfaceCalls.test1stInt3Types avgt 12 8.259 ? 0.045 ns/op >> InterfaceCalls.test1stInt5Types avgt 12 8.279 ? 0.024 ns/op >> InterfaceCalls.test2ndInt2Types avgt 12 2.467 ? 0.023 ns/op >> InterfaceCalls.test2ndInt3Types avgt 12 9.287 ? 0.032 ns/op >> InterfaceCalls.test2ndInt5Types avgt 12 9.343 ? 0.027 ns/op >> InterfaceCalls.testMonomorphic avgt 12 1.440 ? 0.031 ns/op > > Eric Caspole has updated the pull request incrementally with one additional commit since the last revision: > > Renaming as Vladimir suggested. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11696 From fyang at openjdk.org Fri Dec 16 01:02:09 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 16 Dec 2022 01:02:09 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 13:57:12 GMT, Fei Yang wrote: >> The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 >> It looks to me that the fix for the AArch64 port is a nice refactoring work. >> This fixes this issue for the RISC-V port with a similar approach. >> >> Testing: >> Tier1 tested with release build on linux-riscv64 unmatched board. >> Run non-trivial benchmark workloads with fastdebug builds. > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Review > AArch64 patch ([5a5ced3](https://github.com/openjdk/jdk/commit/5a5ced3a900a81fd0b0757017f4138ce97e2521e)) also has modifications that rewrite `_offset` -> `offset()`, `_base` -> `base()`, `_index` -> `index()`, etc. Does RISC-V code has the relevant uses too? Hi, I double-checked files assembler_riscv.hpp/cpp and macroAssembler_riscv.hpp/cpp. And I don't see any occurrence of direct uses for those private data members. ------------- PR: https://git.openjdk.org/jdk/pull/11505 From kbarrett at openjdk.org Fri Dec 16 02:03:27 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 02:03:27 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v3] In-Reply-To: References: Message-ID: > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > relocation type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Kim Barrett has updated the pull request incrementally with two additional commits since the last revision: - use alignas - simplify per jvernee ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11618/files - new: https://git.openjdk.org/jdk/pull/11618/files/2b764714..ebe6e01d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=01-02 Stats: 33 lines in 1 file changed: 0 ins; 13 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/11618.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11618/head:pull/11618 PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Fri Dec 16 02:05:06 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 02:05:06 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> Message-ID: On Thu, 15 Dec 2022 12:28:29 GMT, Jorn Vernee wrote: >> Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: >> >> blank lines in include blocks > > src/hotspot/share/code/relocInfo.hpp line 528: > >> 526: Relocation* reloc = ctor(_relocbuf); >> 527: check_reloc_placement(reloc); >> 528: } > > It seems that the code is like this, rather than using `typename T, typename...Args` (where `T` is the relocation type), because it is not possible to specify explicit type arguments when calling a constructor template? (So you can't specify `T`). > > But, you could truck in the type argument on `Construct`, and have it be inferred. So, this could become: > > Suggestion: > > template > struct Construct {}; // Tag for selecting this constructor. > template RelocationHolder(Construct, const Args&...args) { > check_reloc_type(); > Relocation* reloc = ::new (_relocbuf) T(args...); > check_reloc_placement(reloc); > } > > Callers provide e.g. `Construct()` in `construct` or `Construct()` in the default constructor. `construct` can just forward the rest of the args. > > This seems a little simpler (by avoiding the use of lambdas, and making it more locally obvious that the constructor is doing a placement new into `_relocbuf`), so maybe this is preferable? Yes, that works, and it's an improvement. Thanks. It also let me merge most of the implementation of that constructor with the implementation of `copy_into_impl`, which now use new `emplace_resource`. I also took a moment to use `alignas` instead of a union, since JDK-8297912 was integrated earlier today. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From jrose at openjdk.org Fri Dec 16 02:22:09 2022 From: jrose at openjdk.org (John R Rose) Date: Fri, 16 Dec 2022 02:22:09 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v3] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 02:03:27 GMT, Kim Barrett wrote: >> Please review this change to construction and copying of the Relocation and >> RelocationHolder classes, to eliminate some questionable C++ usage. >> >> The array type for RelocationHandle::_relocbuf is changed from void* to char, >> because using a char array for raw memory is countenanced by the standard, >> while not so much for an array of void*. The desired alignment is maintained >> via a union, since using alignas is not (yet) permitted in HotSpot code. >> >> There is also now a comment discussing the use of _relocbuf in more detail, >> including some areas of continued sketchiness wrto standard conformance and >> reliance on implementation dependent behavior. >> >> No longer use trivial copy and assignment for RelocationHolder, since that >> isn't technically correct. The Relocation in the holder is not trivially >> copyable, since it is polymorphic. It seemed to work in practice with the >> supported compilers, but we shouldn't (and don't need to) rely on it. Instead >> we have a new virtual function Relocation::copy_into that copies the most >> derived object into the holder's _relocbuf using placement new. >> >> Eliminated the implict conversion constructor from Relocation to holder that >> wordwise copied (to possibly beyond the end of) the Relocation into the >> holder's _relocbuf. We could have implemented this more carefully with the >> new approach (using copy_into), but we don't actually need this conversion. >> The only use of it was actually a wasted copy (in assembler_x86.cpp). >> >> Eliminated the use of placement new syntax via operator new with a holder >> argument to copy a Resource object into a holder. This included runtime >> verification that the size of the object is not greater than the size of >> _relocbuf; we now do corresponding verification at compile-time. This also >> included an incorrect attempt at a runtime check that the Relocation base >> class would be at the same address as the derived class being constructed; we >> now perform that check correctly. We also discuss in a comment the layout >> assumption being made (that isn't required by the standard but is provided by >> all supported compilers), and what to do if we encounter a compiler that >> behaves differently. >> >> Eliminated the idiom of making a default-constructed holder and then >> overwriting its held relocation with a newly constructed one, using the afore >> mentioned (and eliminated) operator new. Instead, RelocationHolder now has a >> factory function template (construct) for creating holders with a given >> relocation type, constructed using provided arguments. (The arguments are taken >> as const-ref rather than using perfect forwarding, as the tools for the latter >> are not (yet) approved for use in HotSpot. Using const-ref is good enough in >> this case.) >> >> Describe and verify other assumptions being made, such as all Relocation >> classes being trivially destructible. >> >> Testing: >> mach5 tier1-5 >> >> Future work: >> >> * RelocationHolder::reloc isn't const-correct. Making it so will require >> adjustment of some callers. I'll follow up with an RFE to address this. >> >> * Relocation classes have many virtual function overrides that are unmarked. >> I'll follow up with an RFE to add "override" specifiers. >> >> Potential issue: The removal of RelocationHolder(Relocation*) might not work >> for some platforms. I've tested on platforms supported by Oracle (where there >> was only one (mistaken) use). There might be uses by other platforms. > > Kim Barrett has updated the pull request incrementally with two additional commits since the last revision: > > - use alignas > - simplify per jvernee Marked as reviewed by jrose (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11618 From dlong at openjdk.org Fri Dec 16 02:22:11 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 16 Dec 2022 02:22:11 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v3] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 02:03:27 GMT, Kim Barrett wrote: >> Please review this change to construction and copying of the Relocation and >> RelocationHolder classes, to eliminate some questionable C++ usage. >> >> The array type for RelocationHandle::_relocbuf is changed from void* to char, >> because using a char array for raw memory is countenanced by the standard, >> while not so much for an array of void*. The desired alignment is maintained >> via a union, since using alignas is not (yet) permitted in HotSpot code. >> >> There is also now a comment discussing the use of _relocbuf in more detail, >> including some areas of continued sketchiness wrto standard conformance and >> reliance on implementation dependent behavior. >> >> No longer use trivial copy and assignment for RelocationHolder, since that >> isn't technically correct. The Relocation in the holder is not trivially >> copyable, since it is polymorphic. It seemed to work in practice with the >> supported compilers, but we shouldn't (and don't need to) rely on it. Instead >> we have a new virtual function Relocation::copy_into that copies the most >> derived object into the holder's _relocbuf using placement new. >> >> Eliminated the implict conversion constructor from Relocation to holder that >> wordwise copied (to possibly beyond the end of) the Relocation into the >> holder's _relocbuf. We could have implemented this more carefully with the >> new approach (using copy_into), but we don't actually need this conversion. >> The only use of it was actually a wasted copy (in assembler_x86.cpp). >> >> Eliminated the use of placement new syntax via operator new with a holder >> argument to copy a Resource object into a holder. This included runtime >> verification that the size of the object is not greater than the size of >> _relocbuf; we now do corresponding verification at compile-time. This also >> included an incorrect attempt at a runtime check that the Relocation base >> class would be at the same address as the derived class being constructed; we >> now perform that check correctly. We also discuss in a comment the layout >> assumption being made (that isn't required by the standard but is provided by >> all supported compilers), and what to do if we encounter a compiler that >> behaves differently. >> >> Eliminated the idiom of making a default-constructed holder and then >> overwriting its held relocation with a newly constructed one, using the afore >> mentioned (and eliminated) operator new. Instead, RelocationHolder now has a >> factory function template (construct) for creating holders with a given >> relocation type, constructed using provided arguments. (The arguments are taken >> as const-ref rather than using perfect forwarding, as the tools for the latter >> are not (yet) approved for use in HotSpot. Using const-ref is good enough in >> this case.) >> >> Describe and verify other assumptions being made, such as all Relocation >> classes being trivially destructible. >> >> Testing: >> mach5 tier1-5 >> >> Future work: >> >> * RelocationHolder::reloc isn't const-correct. Making it so will require >> adjustment of some callers. I'll follow up with an RFE to address this. >> >> * Relocation classes have many virtual function overrides that are unmarked. >> I'll follow up with an RFE to add "override" specifiers. >> >> Potential issue: The removal of RelocationHolder(Relocation*) might not work >> for some platforms. I've tested on platforms supported by Oracle (where there >> was only one (mistaken) use). There might be uses by other platforms. > > Kim Barrett has updated the pull request incrementally with two additional commits since the last revision: > > - use alignas > - simplify per jvernee src/hotspot/share/code/relocInfo.hpp line 992: > 990: assert(relocInfo::mustIterateImmediateOopsInCode(), > 991: "Must return true so we will search for oops as roots etc. in the code."); > 992: return RelocationHolder::construct(0, 0); I prefer how it was before, where the arguments have names and comments. src/hotspot/share/code/relocInfo.hpp line 1041: > 1039: // an metadata in the instruction stream > 1040: static RelocationHolder spec_for_immediate() { > 1041: return RelocationHolder::construct(0, 0); I'd rather see (metadata_index, offset) than (0, 0), but I guess the meaning of the argument can be found in spec() above. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From xgong at openjdk.org Fri Dec 16 02:27:04 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 16 Dec 2022 02:27:04 GMT Subject: RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point [v2] In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 09:39:24 GMT, Fei Gao wrote: >> The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. >> >> Take AddReductionVF with 128-bit as an example. >> >> Here is the assembly code before the patch: >> >> fadd s18, s17, s16 >> mov v19.s[0], v16.s[1] >> fadd s18, s18, s19 >> mov v19.s[0], v16.s[2] >> fadd s18, s18, s19 >> mov v19.s[0], v16.s[3] >> fadd s18, s18, s19 >> >> >> Here is the assembly code after the patch: >> >> faddp v19.4s, v16.4s, v16.4s >> faddp s18, v19.2s >> fadd s18, s18, s17 >> >> >> As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. >> >> But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: >> >> 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. >> >> 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. >> >> Tier 1~3 passed with no new failures on Linux AArch64 platform. >> >> Here is the perf data of jmh benchmark [3] for the patch: >> >> Benchmark size Mode Cnt Before After Units >> Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms >> Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms >> Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms >> >> [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- >> https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- >> [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc >> [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 >> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 >> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316 > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Update the comments Looks good to me! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/11663 From haosun at openjdk.org Fri Dec 16 02:52:08 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 16 Dec 2022 02:52:08 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: References: Message-ID: <-Yd26-7XVD93kSYineugXL4nBD-mZk907Gj6JyF-yD0=.3746c63c-5377-4d72-86e2-08fff25e27b3@github.com> On Thu, 15 Dec 2022 21:07:28 GMT, Dmitry Chuyko wrote: >> This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html >> >> In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. >> >> The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern introduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have an 'immI_M1' input. >> >> New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also show the changed code with `-prof perfasm`. >> >> Typical nano-benchmark with a loop and a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usually there are enough registers. However special nano-benchmarks can be considered, e.g. >> >> >> @Benchmark >> @OperationsPerInvocation(TESTSIZE) >> public int max0_use8_i() { >> int sum = 0; >> for(int i = 0; i < TESTSIZE; i++) { >> use8(0, 1, 2, 3, 4, 5, 6, 7); >> sum += Math.max(i, 0); >> } >> return sum; >> } >> >> @CompilerControl(CompilerControl.Mode.DONT_INLINE) >> public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { >> } >> >> >> Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. >> >> New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms (release build). >> >> Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Reverted Ideal change, moved definitions to m4 src/hotspot/cpu/aarch64/aarch64_ad.m4 line 555: > 553: > 554: ins_encode %{ > 555: __ $2(as_Register($dst$$reg), I wonder if it would be better to use `$dst$$Register` here and several other sites in this patch. Suggestion: __ $2($dst$$Register, ------------- PR: https://git.openjdk.org/jdk/pull/11570 From kbarrett at openjdk.org Fri Dec 16 04:48:45 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 04:48:45 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v4] In-Reply-To: References: Message-ID: > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > relocation type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Kim Barrett has updated the pull request incrementally with two additional commits since the last revision: - more comments about trivial relocation destructors - reinstate named args per dlong review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11618/files - new: https://git.openjdk.org/jdk/pull/11618/files/ebe6e01d..90fd6389 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=02-03 Stats: 13 lines in 2 files changed: 9 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/11618.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11618/head:pull/11618 PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Fri Dec 16 04:48:45 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 04:48:45 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: <6HhzRpDqYnqA0tFlewoHqoGGrhE1YacqAIXoxss4ibg=.bc80bfc5-9992-4be0-bc19-eebd3a106e62@github.com> References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> <6HhzRpDqYnqA0tFlewoHqoGGrhE1YacqAIXoxss4ibg=.bc80bfc5-9992-4be0-bc19-eebd3a106e62@github.com> Message-ID: On Thu, 15 Dec 2022 21:34:02 GMT, John R Rose wrote: > I suggest putting all the static asserts in one place, instead of two of them in place A and the third (trivial destructor) in place B. The static asserts about class relationships and size are checked where they are relied upon. The trivial destructor assumption is relied upon elsewhere (holder assignment, copy_into implementations, and ~Relocation) and not there. There isn't really a single good place to verify assumption. I've expanded on comments to indicate where that assumption comes into play. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Fri Dec 16 04:48:46 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 04:48:46 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> Message-ID: On Thu, 15 Dec 2022 12:00:32 GMT, Jorn Vernee wrote: >> Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: >> >> blank lines in include blocks > > src/hotspot/share/code/relocInfo.hpp line 859: > >> 857: // We never heap allocate a Relocation, so never delete through a base pointer. >> 858: // RelocationHolder depends on (and verifies) the destructor for all relocation >> 859: // types is trivial, so can't be virtual. > > Should this be: > Suggestion: > > // types is trivial, so can be non-virtual. > > ? We have a requirement that the derived classes have trivial destructors (so we know it's safe to just construct over the storage without first destructing the old object). Hence this destructor must *not* be virtual, as that would make it and destructors for derived classes not trivial. I expanded the comment a bit to clarify that. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Fri Dec 16 04:48:46 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 04:48:46 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v3] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 02:16:51 GMT, Dean Long wrote: >> Kim Barrett has updated the pull request incrementally with two additional commits since the last revision: >> >> - use alignas >> - simplify per jvernee > > src/hotspot/share/code/relocInfo.hpp line 992: > >> 990: assert(relocInfo::mustIterateImmediateOopsInCode(), >> 991: "Must return true so we will search for oops as roots etc. in the code."); >> 992: return RelocationHolder::construct(0, 0); > > I prefer how it was before, where the arguments have names and comments. Done. > src/hotspot/share/code/relocInfo.hpp line 1041: > >> 1039: // an metadata in the instruction stream >> 1040: static RelocationHolder spec_for_immediate() { >> 1041: return RelocationHolder::construct(0, 0); > > I'd rather see (metadata_index, offset) than (0, 0), but I guess the meaning of the argument can be found in spec() above. Done. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From kvn at openjdk.org Fri Dec 16 05:14:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 05:14:04 GMT Subject: RFR: 8298848: C2: clone all of (CmpP (LoadKlass (AddP down at split if In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 10:53:37 GMT, Roland Westrelin wrote: > As suggested by Vladimir in: > https://github.com/openjdk/jdk/pull/11666 > > Thus extract one for the fixes as a separate PR. The bug as described > in the above PR is: > > The crash occurs because a` (If (Bool (CmpP (LoadKlass ..))))` > only has a single projection. It lost the other projection because of > a `CheckCastPP` that becomes `top`. Initially the pattern is, in pseudo > code: > > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > > `obj` itself is a `CheckCastPP` that's pinned at a dominating if. That > dominating if goes through split through phi. The `LoadKlass` for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > > with` phi1 = (Phi (LoadKlass obj) (LoadKlass obj))` and phi2 = (Phi obj obj) > with `obj = (CheckCastPP#2 obj')` > > `PhiNode::Ideal()` transforms `phi2` into a new `CheckCastPP`: > `(CheckCastPP#3 obj' obj') `with control set to the region right above > the if in the pseudo code above. There happens to be another > `CheckCastPP` at the same control which casts obj' to a narrower > type. So the new `CheckCastPP#3` is replaced by that one (because of > `ConstraintCastNode::dominating_cast()`) and pseudo code becomes: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > > and then: > > > if (phi1 == some_class) { > obj = top; > } > > > because the types of the 2 `CheckCastPP`s conflict. That would be ok if: > > `phi1 == some_class` > > would constant fold. It would if the test was: > > `if (CheckCastPP#4(obj').klass == some_klass) { > ` > but because of split if, the `(CmpP (LoadKlass ..))` and the > `CheckCastPP#1` ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > `(Bool (CmpP (LoadKlass (AddP ..))))` > > down the same way `(Bool (CmpP ..))` is cloned down. After split if, the > pseudo code becomes: > > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > > The bug can't occur because the `CheckCastPP` and` (CmpP (LoadKlass ..))` > operate on the same phi input. The change in split_if.cpp implements > that. My tier1-4, xcomp, stress testing passed. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11689 From kbarrett at openjdk.org Fri Dec 16 06:00:39 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 06:00:39 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v5] In-Reply-To: References: Message-ID: > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > relocation type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: forgot to update comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11618/files - new: https://git.openjdk.org/jdk/pull/11618/files/90fd6389..f9c0642e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=03-04 Stats: 7 lines in 1 file changed: 0 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11618.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11618/head:pull/11618 PR: https://git.openjdk.org/jdk/pull/11618 From fgao at openjdk.org Fri Dec 16 08:15:07 2022 From: fgao at openjdk.org (Fei Gao) Date: Fri, 16 Dec 2022 08:15:07 GMT Subject: RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point [v2] In-Reply-To: <2yYNfUMM6MTmTXzhDs3C7TfPQw4HuHCHnQBlv4P-YGY=.63f5aa05-21e7-44ff-a0c6-f6f7663238f9@github.com> References: <2yYNfUMM6MTmTXzhDs3C7TfPQw4HuHCHnQBlv4P-YGY=.63f5aa05-21e7-44ff-a0c6-f6f7663238f9@github.com> Message-ID: On Thu, 15 Dec 2022 12:04:39 GMT, Andrew Haley wrote: >> Fei Gao has updated the pull request incrementally with one additional commit since the last revision: >> >> Update the comments > > Marked as reviewed by aph (Reviewer). Thanks for your review, @theRealAph @XiaohongGong. I'll integrate it. ------------- PR: https://git.openjdk.org/jdk/pull/11663 From shade at openjdk.org Fri Dec 16 08:33:04 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 16 Dec 2022 08:33:04 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 13:57:12 GMT, Fei Yang wrote: >> The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 >> It looks to me that the fix for the AArch64 port is a nice refactoring work. >> This fixes this issue for the RISC-V port with a similar approach. >> >> Testing: >> Tier1 tested with release build on linux-riscv64 unmatched board. >> Run non-trivial benchmark workloads with fastdebug builds. > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Review Okay then! ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/11505 From fyang at openjdk.org Fri Dec 16 08:50:15 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 16 Dec 2022 08:50:15 GMT Subject: RFR: 8298088: RISC-V: Make Address a discriminated union internally [v2] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 08:30:00 GMT, Aleksey Shipilev wrote: > Okay then! Thanks for the review. Let's ------------- PR: https://git.openjdk.org/jdk/pull/11505 From pli at openjdk.org Fri Dec 16 09:05:39 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 16 Dec 2022 09:05:39 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests [v2] In-Reply-To: References: Message-ID: > In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. > > This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. > > Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. > > A few more test cases are added within this patch as well. > > We tested the new IR rules on below kinds of CPUs. > - AArch64 w/ 512-bit SVE > - AArch64 w/ 128-bit SVE > - AArch64 w/o SVE (NEON only) > - x86 Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11687/files - new: https://git.openjdk.org/jdk/pull/11687/files/34248ab7..3a87694f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11687&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11687&range=00-01 Stats: 452 lines in 22 files changed: 211 ins; 2 del; 239 mod Patch: https://git.openjdk.org/jdk/pull/11687.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11687/head:pull/11687 PR: https://git.openjdk.org/jdk/pull/11687 From pli at openjdk.org Fri Dec 16 09:05:41 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 16 Dec 2022 09:05:41 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 20:23:16 GMT, Vladimir Kozlov wrote: >> In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. >> >> This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. >> >> Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. >> >> A few more test cases are added within this patch as well. >> >> We tested the new IR rules on below kinds of CPUs. >> - AArch64 w/ 512-bit SVE >> - AArch64 w/ 128-bit SVE >> - AArch64 w/o SVE (NEON only) >> - x86 > > Don't forget to change `applyIfCPUFeature` to `applyIfCPUFeatureOr` > > And to run these tests with different AVX configuration we need to remove `vm.flagless` from `@requires`. `-XX:UseAVX=n` flag is supported by IR testing now. Hi @vnkozlov , Thanks for your review and test. I have made below changes based on your comments. - Add `avx2` rules for x86 according to your instructions - Reduce `SIZE` (some test code is updated to align with the reduced `SIZE`) > And to run these tests with different AVX configuration we need to remove vm.flagless from `@requires`. Unfortunately, for now we cannot remove `vm.flagless` from `@requires`. As these test methods are also used for correctness check (each test method is invoked twice and the return results from the interpreter and C2 compiled code are compared) and we are using compiler control via WhiteBox API to force these methods running in interpreter and C2. The compiler control won't work if some extra vm option is specified. ------------- PR: https://git.openjdk.org/jdk/pull/11687 From pli at openjdk.org Fri Dec 16 09:21:48 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 16 Dec 2022 09:21:48 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 20:23:16 GMT, Vladimir Kozlov wrote: >> In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. >> >> This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. >> >> Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. >> >> A few more test cases are added within this patch as well. >> >> We tested the new IR rules on below kinds of CPUs. >> - AArch64 w/ 512-bit SVE >> - AArch64 w/ 128-bit SVE >> - AArch64 w/o SVE (NEON only) >> - x86 > > Don't forget to change `applyIfCPUFeature` to `applyIfCPUFeatureOr` > > And to run these tests with different AVX configuration we need to remove `vm.flagless` from `@requires`. `-XX:UseAVX=n` flag is supported by IR testing now. BTW, previously I would like to add IR checks for x86 as well. But I don't have various generations of x86 machines to test so I don't have enough confidence to add them. Thanks @vnkozlov for your test effort on avx2. I would appreciate if someone from Intel (maybe @jbhateja) may help verify the x86 rules. ------------- PR: https://git.openjdk.org/jdk/pull/11687 From fyang at openjdk.org Fri Dec 16 09:28:04 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 16 Dec 2022 09:28:04 GMT Subject: Integrated: 8298088: RISC-V: Make Address a discriminated union internally In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 08:42:22 GMT, Fei Yang wrote: > The RISC-V port has the same issue as: https://bugs.openjdk.org/browse/JDK-8297830 > It looks to me that the fix for the AArch64 port is a nice refactoring work. > This fixes this issue for the RISC-V port with a similar approach. > > Testing: > Tier1 tested with release build on linux-riscv64 unmatched board. > Run non-trivial benchmark workloads with fastdebug builds. This pull request has now been integrated. Changeset: 226e579c Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/226e579c3004a37a09f3329a8ef09c0933126bd6 Stats: 141 lines in 2 files changed: 92 ins; 11 del; 38 mod 8298088: RISC-V: Make Address a discriminated union internally Reviewed-by: fjiang, yadongwang, shade ------------- PR: https://git.openjdk.org/jdk/pull/11505 From duke at openjdk.org Fri Dec 16 09:37:57 2022 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 16 Dec 2022 09:37:57 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference Message-ID: `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. Checking for NULL reference before checking if blob is a method. ------------- Commit messages: - JDK-8297801: printnm crashes with invalid address due to null pointer dereference Changes: https://git.openjdk.org/jdk/pull/11697/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11697&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297801 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11697.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11697/head:pull/11697 PR: https://git.openjdk.org/jdk/pull/11697 From thartmann at openjdk.org Fri Dec 16 09:37:57 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 09:37:57 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 16:10:07 GMT, Damon Fenacci wrote: > `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. > Checking for NULL reference before checking if blob is a method. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11697 From duke at openjdk.org Fri Dec 16 09:38:10 2022 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 16 Dec 2022 09:38:10 GMT Subject: RFR: 8298736: Revisit usages of log10 in compiler code Message-ID: The use of the Math library `log10` function causes an overloading ambiguity error on SPARC when using it with integer typed parameters. * adding a `static_cast` to the parameter * using the lib `log10` function (with `static_cast`s) instead of a custom one in `src/hotspot/share/opto/node.cpp` ------------- Commit messages: - JDK-8298736: Revisit usages of log10 in compiler code Changes: https://git.openjdk.org/jdk/pull/11686/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11686&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298736 Stats: 14 lines in 2 files changed: 0 ins; 11 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11686.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11686/head:pull/11686 PR: https://git.openjdk.org/jdk/pull/11686 From thartmann at openjdk.org Fri Dec 16 09:38:11 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 09:38:11 GMT Subject: RFR: 8298736: Revisit usages of log10 in compiler code In-Reply-To: References: Message-ID: <6jZjyPnQWx5MezVvn8n4pIVyrR9_9vqV3iOcXxUVvKM=.261f4514-b2a4-41cf-a99c-98009436e560@github.com> On Thu, 15 Dec 2022 07:54:18 GMT, Damon Fenacci wrote: > The use of the Math library `log10` function causes an overloading ambiguity error on SPARC when using it with integer typed parameters. > > * adding a `static_cast` to the parameter > * using the lib `log10` function (with `static_cast`s) instead of a custom one in `src/hotspot/share/opto/node.cpp` Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11686 From epeter at openjdk.org Fri Dec 16 09:38:11 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 16 Dec 2022 09:38:11 GMT Subject: RFR: 8298736: Revisit usages of log10 in compiler code In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 07:54:18 GMT, Damon Fenacci wrote: > The use of the Math library `log10` function causes an overloading ambiguity error on SPARC when using it with integer typed parameters. > > * adding a `static_cast` to the parameter > * using the lib `log10` function (with `static_cast`s) instead of a custom one in `src/hotspot/share/opto/node.cpp` Replacing my custom `log10` with lib version looks good, we verified it manually. ------------- PR: https://git.openjdk.org/jdk/pull/11686 From duke at openjdk.org Fri Dec 16 09:39:17 2022 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 16 Dec 2022 09:39:17 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int Message-ID: Changed return type of `CompileTask::compile_id()` from `int` to `uint`. Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. Added *asserts* to check for valid value range where not possible. ------------- Commit messages: - JDK-8295661 correct indentation and update copyright year - JDK-8295661 change type of _compile_id field from uint to int and fix all inconsistencies (make al compile IDs int) - Revert "JDK-8295661: CompileTask::compile_id() should return uint instead of int" - Revert "JDK-8295661 fix assert conditions" - Revert "JDK-8295661 fix indentation/white spaces" - Revert "JDK-8295661 revert wrong parameter name change" - Revert "JDK-8295661 update copyright date" - Revert "Update src/hotspot/share/c1/c1_Compilation.cpp" - Revert "JDK-8295661 review fixes" - JDK-8295661 review fixes - ... and 6 more: https://git.openjdk.org/jdk/compare/a37de62d...b9111ef5 Changes: https://git.openjdk.org/jdk/pull/11630/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11630&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295661 Stats: 47 lines in 9 files changed: 0 ins; 2 del; 45 mod Patch: https://git.openjdk.org/jdk/pull/11630.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11630/head:pull/11630 PR: https://git.openjdk.org/jdk/pull/11630 From thartmann at openjdk.org Fri Dec 16 09:39:18 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 09:39:18 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 11:29:55 GMT, Damon Fenacci wrote: > Changed return type of `CompileTask::compile_id()` from `int` to `uint`. > Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. > Added *asserts* to check for valid value range where not possible. Looks good, I added a few comments. Someone more familiar with JVMCI should have a look as well. Marked as reviewed by thartmann (Reviewer). Using unsigned int consistently feels appropriate to represent a compilation id, especially since we are already mixing unsigned and signed integers. But given the impact on JVMCI and the leakage into Java code which does not support an unsigned int, it's reasonable to use int consistently instead. The updated changes look good to me. Thanks for making these changes. Looks good! Thanks Tom. The intention of this change was mainly consistency, not avoiding a potential overflow. src/hotspot/share/c1/c1_Compilation.cpp line 616: > 614: Compilation::~Compilation() { > 615: // simulate crash during compilation > 616: assert(CICrashAt >= UINT_MAX || _env->compile_id() != CICrashAt, "just as planned"); I'm wondering if we should remove the `CICrashAt >= UINT_MAX` part to catch an UINT overflow before it can happen? Suggestion: assert(_env->compile_id() != CICrashAt, "just as planned"); src/hotspot/share/jvmci/jvmciCompilerToVM.cpp line 1182: > 1180: C2V_END > 1181: > 1182: C2V_VMENTRY_0(juint, allocateCompileId, (JNIEnv* env, jobject, ARGUMENT_PAIR(method), int entry_bci)) This is called from Java code which does not have an unsigned int. I think we need to leave this as `jint` here. src/hotspot/share/jvmci/jvmciEnv.cpp line 1625: > 1623: if (cb == (CodeBlob*) code) { > 1624: nmethod* nm = cb->as_nmethod_or_null(); > 1625: assert(compile_id_snapshot >= 0, "negative compile id snapshot"); That's something to check with the JVMCI experts but I'm wondering why `compile_id_snapshot` is a jlong whereas the compile id is a jint. src/hotspot/share/jvmci/jvmciRuntime.cpp line 2025: > 2023: assert(compile_state->task()->compile_id() <= INT_MAX, "compile id too big"); > 2024: JVMCIObject result_object = JVMCIENV->call_HotSpotJVMCIRuntime_compileMethod(receiver, jvmci_method, entry_bci, > 2025: (jlong) compile_state, (int) compile_state->task()->compile_id()); Should be a cast to `jint`, right? Suggestion: (jlong) compile_state, (jint) compile_state->task()->compile_id()); src/hotspot/share/opto/idealGraphPrinter.cpp line 231: > 229: } > 230: > 231: void IdealGraphPrinter::print_prop(const char *name, unsigned int val) { Did you test these changes with IGV? src/hotspot/share/runtime/sharedRuntime.cpp line 3106: > 3104: > 3105: const uint compile_id = CompileBroker::assign_compile_id(method, CompileBroker::standard_entry_bci); > 3106: But `compile_id` could still be 0, right? ------------- PR: https://git.openjdk.org/jdk/pull/11630Marked as reviewed by thartmann (Reviewer). From duke at openjdk.org Fri Dec 16 09:39:18 2022 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 16 Dec 2022 09:39:18 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: <0JZT6zek7pAk2X2st0HlV57CZoJlpbYoYa5DelgzJ6w=.e920cc74-eeaf-4d22-94bb-a55e7070b6db@github.com> On Mon, 12 Dec 2022 11:29:55 GMT, Damon Fenacci wrote: > Changed return type of `CompileTask::compile_id()` from `int` to `uint`. > Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. > Added *asserts* to check for valid value range where not possible. Thanks a lot for the review. I should have addressed all your comments. ------------- PR: https://git.openjdk.org/jdk/pull/11630 From dnsimon at openjdk.org Fri Dec 16 09:39:18 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 16 Dec 2022 09:39:18 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 11:29:55 GMT, Damon Fenacci wrote: > Changed return type of `CompileTask::compile_id()` from `int` to `uint`. > Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. > Added *asserts* to check for valid value range where not possible. The JVMCI changes in this PR look correct. @tkrodriguez could also please look them over. ------------- Marked as reviewed by dnsimon (Committer). PR: https://git.openjdk.org/jdk/pull/11630 From never at openjdk.org Fri Dec 16 09:39:18 2022 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 16 Dec 2022 09:39:18 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 11:29:55 GMT, Damon Fenacci wrote: > Changed return type of `CompileTask::compile_id()` from `int` to `uint`. > Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. > Added *asserts* to check for valid value range where not possible. Marked as reviewed by never (Reviewer). Why isn't the proper fix to simply change the `CompileTask::_compile_id` to an int? Every other piece of code thinks it's an int so we should that declaration be controlling? `CompileBroker::assign_compile_id` returns int which seems much more definitive to me. Is there some known problem with running out of compile ids in the in signed range? If this range is really being promoted to unsigned int then I think this needs to be properly exposed all the way through the JVMCI API with some new API to expose it as a long. Simply adding asserts for something we believe can legally occur really isn't sufficient. If we were really worried about overflow then I think it should be promoted to a jlong since the difference between the signed and unsigned range is really relatively small. I think the required changes to JVMCI could be made in a backward compatible way where there is new API treats it as long. The primary place where this is exposed is in `HotSpotCompilationRequest.getId`. The old API could be made to throw an exception if it encounters a too large id. Anyway if we want to address that issue I can probably provide the required JVMCI changes. ------------- PR: https://git.openjdk.org/jdk/pull/11630 From duke at openjdk.org Fri Dec 16 09:39:19 2022 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 16 Dec 2022 09:39:19 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 18:50:40 GMT, Tom Rodriguez wrote: >> Changed return type of `CompileTask::compile_id()` from `int` to `uint`. >> Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. >> Added *asserts* to check for valid value range where not possible. > > Why isn't the proper fix to simply change the `CompileTask::_compile_id` to an int? Every other piece of code thinks it's an int so we should that declaration be controlling? `CompileBroker::assign_compile_id` returns int which seems much more definitive to me. Is there some known problem with running out of compile ids in the in signed range? > > If this range is really being promoted to unsigned int then I think this needs to be properly exposed all the way through the JVMCI API with some new API to expose it as a long. Simply adding asserts for something we believe can legally occur really isn't sufficient. @tkrodriguez thanks a lot for your insight. I see your point: the pragmatic approach you propose is surely more appropriate. Reverting the change. @tkrodriguez @TobiHartmann @dougxc I've reverted the previous changes and changed `CompileTask::_compile_id` to an `int` instead. I've also changed the type of the compile id from `uint` to `int` in a few other places to make things consistent. > If we were really worried about overflow then I think it should be promoted to a jlong since the difference between the signed and unsigned range is really relatively small. I think the required changes to JVMCI could be made in a backward compatible way where there is new API treats it as long. The primary place where this is exposed is in `HotSpotCompilationRequest.getId`. The old API could be made to throw an exception if it encounters a too large id. Anyway if we want to address that issue I can probably provide the required JVMCI changes. Thanks @tkrodriguez @TobiHartmann @dougxc for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11630 From duke at openjdk.org Fri Dec 16 09:39:19 2022 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 16 Dec 2022 09:39:19 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 12:12:09 GMT, Tobias Hartmann wrote: >> Changed return type of `CompileTask::compile_id()` from `int` to `uint`. >> Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. >> Added *asserts* to check for valid value range where not possible. > > src/hotspot/share/c1/c1_Compilation.cpp line 616: > >> 614: Compilation::~Compilation() { >> 615: // simulate crash during compilation >> 616: assert(CICrashAt >= UINT_MAX || _env->compile_id() != CICrashAt, "just as planned"); > > I'm wondering if we should remove the `CICrashAt >= UINT_MAX` part to catch an UINT overflow before it can happen? > > Suggestion: > > assert(_env->compile_id() != CICrashAt, "just as planned"); OK > src/hotspot/share/jvmci/jvmciCompilerToVM.cpp line 1182: > >> 1180: C2V_END >> 1181: >> 1182: C2V_VMENTRY_0(juint, allocateCompileId, (JNIEnv* env, jobject, ARGUMENT_PAIR(method), int entry_bci)) > > This is called from Java code which does not have an unsigned int. I think we need to leave this as `jint` here. Reverted to `jint`. I also added an assert to check for big `uint` values. > src/hotspot/share/jvmci/jvmciEnv.cpp line 1625: > >> 1623: if (cb == (CodeBlob*) code) { >> 1624: nmethod* nm = cb->as_nmethod_or_null(); >> 1625: assert(compile_id_snapshot >= 0, "negative compile id snapshot"); > > That's something to check with the JVMCI experts but I'm wondering why `compile_id_snapshot` is a jlong whereas the compile id is a jint. OK. Strange indeed! > src/hotspot/share/jvmci/jvmciRuntime.cpp line 2025: > >> 2023: assert(compile_state->task()->compile_id() <= INT_MAX, "compile id too big"); >> 2024: JVMCIObject result_object = JVMCIENV->call_HotSpotJVMCIRuntime_compileMethod(receiver, jvmci_method, entry_bci, >> 2025: (jlong) compile_state, (int) compile_state->task()->compile_id()); > > Should be a cast to `jint`, right? > > Suggestion: > > (jlong) compile_state, (jint) compile_state->task()->compile_id()); I'm not sure. I just noticed that the signature of the called method has an `int` argument: `JVMCIObject JVMCIEnv::call_HotSpotJVMCIRuntime_compileMethod (JVMCIObject runtime, JVMCIObject method, int entry_bci, jlong compile_state, int id)` > src/hotspot/share/opto/idealGraphPrinter.cpp line 231: > >> 229: } >> 230: >> 231: void IdealGraphPrinter::print_prop(const char *name, unsigned int val) { > > Did you test these changes with IGV? I've ran `java -XX:PrintIdealGraphLevel=4 -Xcomp -XX:PrintIdealGraphFile=graph.xml` with some extra prints in the new and old `print_prop` methods (to check that they were actually used). > src/hotspot/share/runtime/sharedRuntime.cpp line 3106: > >> 3104: >> 3105: const uint compile_id = CompileBroker::assign_compile_id(method, CompileBroker::standard_entry_bci); >> 3106: > > But `compile_id` could still be 0, right? Right! I've put the assert statement back. ------------- PR: https://git.openjdk.org/jdk/pull/11630 From dnsimon at openjdk.org Fri Dec 16 09:39:19 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 16 Dec 2022 09:39:19 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 13:35:26 GMT, Damon Fenacci wrote: >> src/hotspot/share/jvmci/jvmciEnv.cpp line 1625: >> >>> 1623: if (cb == (CodeBlob*) code) { >>> 1624: nmethod* nm = cb->as_nmethod_or_null(); >>> 1625: assert(compile_id_snapshot >= 0, "negative compile id snapshot"); >> >> That's something to check with the JVMCI experts but I'm wondering why `compile_id_snapshot` is a jlong whereas the compile id is a jint. > > OK. Strange indeed! It's a good question and unfortunately I cannot recall the reason. Maybe @tkrodriguez can. In any case, I don't think it impacts this change. ------------- PR: https://git.openjdk.org/jdk/pull/11630 From never at openjdk.org Fri Dec 16 09:39:19 2022 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 16 Dec 2022 09:39:19 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 14:44:28 GMT, Doug Simon wrote: >> OK. Strange indeed! > > It's a good question and unfortunately I cannot recall the reason. Maybe @tkrodriguez can. In any case, I don't think it impacts this change. That's far enough back that I don't really know why I used jlong. Its type should really match the real compile id type whatever that is. It's also fully internal to the JVMCI implementation so you can change it without affecting Graal. ------------- PR: https://git.openjdk.org/jdk/pull/11630 From chagedorn at openjdk.org Fri Dec 16 09:46:12 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 16 Dec 2022 09:46:12 GMT Subject: RFR: 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI Message-ID: [JDK-8292889](https://bugs.openjdk.org/browse/JDK-8292289) added the following optimization to `BoolNode::Ideal()` for patterns that include `CMoveI` nodes: https://github.com/openjdk/jdk/blob/fa322e40b68abf0a253040d14414d41f4e01e028/src/hotspot/share/opto/subnode.cpp#L1465-L1472 However, we could have a `CMoveI` during IGVN that will later be folded because the `Bool` condition node was replaced by a constant but IGVN has not processed this node, yet: ![Screenshot from 2022-12-16 09-40-28](https://user-images.githubusercontent.com/17833009/208068197-4819b322-604c-412e-8898-3d3546a8a663.png) We fail when trying to call `as_Bool()` on `28 ConI`. The fix is straight forward to additionally check if we actually have a `BoolNode`. Thanks, Christian ------------- Commit messages: - 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI Changes: https://git.openjdk.org/jdk/pull/11705/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11705&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298824 Stats: 53 lines in 2 files changed: 52 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11705.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11705/head:pull/11705 PR: https://git.openjdk.org/jdk/pull/11705 From thartmann at openjdk.org Fri Dec 16 09:45:54 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 09:45:54 GMT Subject: [jdk20] RFR: 8298919: Add a regression test for JDK-8298520 Message-ID: The fix for [JDK-8298520](https://bugs.openjdk.org/browse/JDK-8298520) does not include a regression test. In the meantime, the JavaFuzzer found one. Let's add a simplified version of it. Thanks, Tobias ------------- Commit messages: - 8298919: Add a regression test for JDK-8298520 Changes: https://git.openjdk.org/jdk20/pull/44/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=44&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298919 Stats: 57 lines in 1 file changed: 57 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk20/pull/44.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/44/head:pull/44 PR: https://git.openjdk.org/jdk20/pull/44 From chagedorn at openjdk.org Fri Dec 16 09:56:58 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 16 Dec 2022 09:56:58 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 16:10:07 GMT, Damon Fenacci wrote: > `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. > Checking for NULL reference before checking if blob is a method. Looks good! Maybe we could also think about printing an error message in case we passed an invalid address to that method. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11697 From chagedorn at openjdk.org Fri Dec 16 09:56:59 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 16 Dec 2022 09:56:59 GMT Subject: RFR: 8298736: Revisit usages of log10 in compiler code In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 07:54:18 GMT, Damon Fenacci wrote: > The use of the Math library `log10` function causes an overloading ambiguity error on SPARC when using it with integer typed parameters. > > * adding a `static_cast` to the parameter > * using the lib `log10` function (with `static_cast`s) instead of a custom one in `src/hotspot/share/opto/node.cpp` Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11686 From epeter at openjdk.org Fri Dec 16 09:59:50 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 16 Dec 2022 09:59:50 GMT Subject: RFR: 8298736: Revisit usages of log10 in compiler code In-Reply-To: References: Message-ID: <9VIZuwS1aoCL72SPSTVQI6CVtmt2Xen8R6EiDmZdiD0=.c7eb4bc0-9035-4ae7-b6b6-d23bd58e4d68@github.com> On Thu, 15 Dec 2022 07:54:18 GMT, Damon Fenacci wrote: > The use of the Math library `log10` function causes an overloading ambiguity error on SPARC when using it with integer typed parameters. > > * adding a `static_cast` to the parameter > * using the lib `log10` function (with `static_cast`s) instead of a custom one in `src/hotspot/share/opto/node.cpp` Looks good! ------------- Marked as reviewed by epeter (Committer). PR: https://git.openjdk.org/jdk/pull/11686 From chagedorn at openjdk.org Fri Dec 16 10:00:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 16 Dec 2022 10:00:00 GMT Subject: [jdk20] RFR: 8298919: Add a regression test for JDK-8298520 In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:36:44 GMT, Tobias Hartmann wrote: > The fix for [JDK-8298520](https://bugs.openjdk.org/browse/JDK-8298520) does not include a regression test. In the meantime, the JavaFuzzer found one. Let's add a simplified version of it. > > Thanks, > Tobias Thanks for adding that test! Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/44 From thartmann at openjdk.org Fri Dec 16 10:07:58 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 10:07:58 GMT Subject: [jdk20] RFR: 8298919: Add a regression test for JDK-8298520 In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:36:44 GMT, Tobias Hartmann wrote: > The fix for [JDK-8298520](https://bugs.openjdk.org/browse/JDK-8298520) does not include a regression test. In the meantime, the JavaFuzzer found one. Let's add a simplified version of it. > > Thanks, > Tobias Thanks for the review, Christian! ------------- PR: https://git.openjdk.org/jdk20/pull/44 From duke at openjdk.org Fri Dec 16 10:50:47 2022 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 16 Dec 2022 10:50:47 GMT Subject: RFR: 8265688: Unused ciMethodType::ptype_at should be removed Message-ID: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> `ciMethodType::ptype_at` method is not used. Removing it. ------------- Commit messages: - JDK-8265688 Unused ciMethodType::ptype_at should be removed Changes: https://git.openjdk.org/jdk/pull/11708/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11708&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8265688 Stats: 12 lines in 2 files changed: 0 ins; 9 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11708.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11708/head:pull/11708 PR: https://git.openjdk.org/jdk/pull/11708 From thartmann at openjdk.org Fri Dec 16 10:50:48 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 10:50:48 GMT Subject: RFR: 8265688: Unused ciMethodType::ptype_at should be removed In-Reply-To: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> References: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> Message-ID: On Fri, 16 Dec 2022 10:28:09 GMT, Damon Fenacci wrote: > `ciMethodType::ptype_at` method is not used. > > Removing it. Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11708 From dchuyko at openjdk.org Fri Dec 16 11:08:22 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Fri, 16 Dec 2022 11:08:22 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v3] In-Reply-To: References: Message-ID: > This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html > > In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. > > The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern introduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have a n 'immI_M1' input. > > New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also show the changed code with `-prof perfasm`. > > Typical nano-benchmark with a loop and a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usually there are enough registers. However special nano-benchmarks can be considered, e.g. > > > @Benchmark > @OperationsPerInvocation(TESTSIZE) > public int max0_use8_i() { > int sum = 0; > for(int i = 0; i < TESTSIZE; i++) { > use8(0, 1, 2, 3, 4, 5, 6, 7); > sum += Math.max(i, 0); > } > return sum; > } > > @CompilerControl(CompilerControl.Mode.DONT_INLINE) > public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { > } > > > Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. > > New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms (release build). > > Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'openjdk:master' into JDK-8153837 - Reverted Ideal change, moved definitions to m4 - JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11570/files - new: https://git.openjdk.org/jdk/pull/11570/files/0b9ed33f..489a2118 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11570&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11570&range=01-02 Stats: 23445 lines in 628 files changed: 16122 ins; 4901 del; 2422 mod Patch: https://git.openjdk.org/jdk/pull/11570.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11570/head:pull/11570 PR: https://git.openjdk.org/jdk/pull/11570 From epeter at openjdk.org Fri Dec 16 11:17:15 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 16 Dec 2022 11:17:15 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts Message-ID: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> **Context** During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. **Problem case** My jasm fuzzer produced some infinite loops that have the following form: The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). Why did we not find this earlier? We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. **Solution** We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. ------------- Commit messages: - 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts Changes: https://git.openjdk.org/jdk/pull/11706/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11706&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296412 Stats: 254 lines in 3 files changed: 254 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11706.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11706/head:pull/11706 PR: https://git.openjdk.org/jdk/pull/11706 From chagedorn at openjdk.org Fri Dec 16 11:31:49 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 16 Dec 2022 11:31:49 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts In-Reply-To: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> Message-ID: <_nUosJHV6UacOU7_YzbZm6EGv66cQsDtYz-iWw8pauY=.b9888867-afc6-4a3a-9ddd-7b762a394ece@github.com> On Fri, 16 Dec 2022 09:57:35 GMT, Emanuel Peter wrote: > **Context** > During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 > Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. > > Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). > Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. > > **Problem case** > My jasm fuzzer produced some infinite loops that have the following form: > The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. > When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). > > Why did we not find this earlier? > We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. > > **Solution** > We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. > > I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. Nice analysis! That looks reasonable to go with this simpler fix instead of fixing `maybe_add_safepoint()` given that this case is rare. src/hotspot/share/opto/loopnode.cpp line 3645: > 3643: break; > 3644: } > 3645: n = nlpt->_head; Just a small detail, I would move this above the new `if` such that you can use `n` in the check. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11706 From smonteith at openjdk.org Fri Dec 16 11:54:50 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Fri, 16 Dec 2022 11:54:50 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: <7LyfJnbqznuJcRX0y0-BxH_QaUh7IMLXPwsVSRWuScg=.6e52f634-6e4a-45f0-9a50-8cd1d044ca1f@github.com> References: <2iWMACcRXODgZh4RMCaJgucokFeMmUaeYEzfWLDTUc4=.a7aa150d-67ed-4997-98ac-dab07f220591@github.com> <7LyfJnbqznuJcRX0y0-BxH_QaUh7IMLXPwsVSRWuScg=.6e52f634-6e4a-45f0-9a50-8cd1d044ca1f@github.com> Message-ID: On Tue, 1 Nov 2022 14:31:38 GMT, Andrew Haley wrote: >> @theRealAph just means the cases for op_ExpandBits/op_CompressBits. >> aarch64_sve.ad was merged with aarch64_neon.ad into aarch64_vector.ad, but while I'm using SVE instructions, I'm not using SVE types. match_rule_supported_vector isn't called by the scalar compiler code for intrinsics, and so we'd be deviating further from the common code. I'm reluctant to move the rules into aarch64_vector.ad file, as that would separate it from the match_rule_supported code that enables it - placing it among vector code doesn't really match the intent behind the code using vector instructions to perform scalar operations. > > Hmm, interesting. Seems a bit odd, but OK. I guess I have to admit that a bunch of code that's not vectors uses the vector uint. The code is similar to what was done with popcountI/L in aarch64.ad . My next patch will improve things somewhat by loading constants directly into the vector/floating point registers, rather than going through the rigmarole of going via a GPR. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From thartmann at openjdk.org Fri Dec 16 12:17:53 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 12:17:53 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: <-ngkEj9A4FB4nrV_c9UTCMbcakgfkHUpKBn26dED7kU=.2f0f9008-d61f-4c5d-8d0d-30c2a332fe0e@github.com> On Thu, 15 Dec 2022 02:52:51 GMT, SUN Guoyun wrote: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun This looks reasonable to me but I'm not an expert in that area. I'll run some performance testing for sanity and report back once it passed. src/hotspot/share/opto/chaitin.hpp line 481: > 479: Block **_blks; // Array of blocks sorted by frequency for coalescing > 480: > 481: double _high_frequency_lrg; // Frequency at which LRG will be spilled for debug info Suggestion: double _high_frequency_lrg; // Frequency at which LRG will be spilled for debug info ------------- PR: https://git.openjdk.org/jdk/pull/11685 From epeter at openjdk.org Fri Dec 16 12:21:43 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 16 Dec 2022 12:21:43 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts [v2] In-Reply-To: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> Message-ID: > **Context** > During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 > Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. > > Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). > Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. > > **Problem case** > My jasm fuzzer produced some infinite loops that have the following form: > The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. > When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). > > Why did we not find this earlier? > We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. > > **Solution** > We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. > > I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Christian's review suggestion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11706/files - new: https://git.openjdk.org/jdk/pull/11706/files/782c2cfa..2a02b338 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11706&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11706&range=00-01 Stats: 3 lines in 1 file changed: 1 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11706.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11706/head:pull/11706 PR: https://git.openjdk.org/jdk/pull/11706 From epeter at openjdk.org Fri Dec 16 12:21:45 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 16 Dec 2022 12:21:45 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts [v2] In-Reply-To: <_nUosJHV6UacOU7_YzbZm6EGv66cQsDtYz-iWw8pauY=.b9888867-afc6-4a3a-9ddd-7b762a394ece@github.com> References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> <_nUosJHV6UacOU7_YzbZm6EGv66cQsDtYz-iWw8pauY=.b9888867-afc6-4a3a-9ddd-7b762a394ece@github.com> Message-ID: On Fri, 16 Dec 2022 11:28:26 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Christian's review suggestion > > src/hotspot/share/opto/loopnode.cpp line 3645: > >> 3643: break; >> 3644: } >> 3645: n = nlpt->_head; > > Just a small detail, I would move this above the new `if` such that you can use `n` in the check. ? ------------- PR: https://git.openjdk.org/jdk/pull/11706 From jvernee at openjdk.org Fri Dec 16 12:24:49 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Fri, 16 Dec 2022 12:24:49 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> Message-ID: On Fri, 16 Dec 2022 02:54:14 GMT, Kim Barrett wrote: >> src/hotspot/share/code/relocInfo.hpp line 859: >> >>> 857: // We never heap allocate a Relocation, so never delete through a base pointer. >>> 858: // RelocationHolder depends on (and verifies) the destructor for all relocation >>> 859: // types is trivial, so can't be virtual. >> >> Should this be: >> Suggestion: >> >> // types is trivial, so can be non-virtual. >> >> ? > > We have a requirement that the derived classes have trivial destructors (so we know it's safe to just construct over the storage without first destructing the old object). Hence this destructor must *not* be virtual, as that would make it and destructors for derived classes not trivial. I expanded the comment a bit to clarify that. Ah, ok. Thanks for explaining. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From chagedorn at openjdk.org Fri Dec 16 12:28:49 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 16 Dec 2022 12:28:49 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts [v2] In-Reply-To: References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> Message-ID: On Fri, 16 Dec 2022 12:21:43 GMT, Emanuel Peter wrote: >> **Context** >> During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 >> Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. >> >> Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). >> Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. >> >> **Problem case** >> My jasm fuzzer produced some infinite loops that have the following form: >> The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. >> When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). >> >> Why did we not find this earlier? >> We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. >> >> **Solution** >> We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. >> >> I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Christian's review suggestion Marked as reviewed by chagedorn (Reviewer). src/hotspot/share/opto/loopnode.cpp line 3636: > 3634: n = nlpt->_head; > 3635: if (_head == n) { > 3636: // this and nlpt (inner loop) have the same loop head. This should not happen because Suggestion: // this and n (inner loop) have the same loop head. This should not happen because ------------- PR: https://git.openjdk.org/jdk/pull/11706 From jvernee at openjdk.org Fri Dec 16 12:37:05 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Fri, 16 Dec 2022 12:37:05 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v5] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 06:00:39 GMT, Kim Barrett wrote: >> Please review this change to construction and copying of the Relocation and >> RelocationHolder classes, to eliminate some questionable C++ usage. >> >> The array type for RelocationHandle::_relocbuf is changed from void* to char, >> because using a char array for raw memory is countenanced by the standard, >> while not so much for an array of void*. The desired alignment is maintained >> via a union, since using alignas is not (yet) permitted in HotSpot code. >> >> There is also now a comment discussing the use of _relocbuf in more detail, >> including some areas of continued sketchiness wrto standard conformance and >> reliance on implementation dependent behavior. >> >> No longer use trivial copy and assignment for RelocationHolder, since that >> isn't technically correct. The Relocation in the holder is not trivially >> copyable, since it is polymorphic. It seemed to work in practice with the >> supported compilers, but we shouldn't (and don't need to) rely on it. Instead >> we have a new virtual function Relocation::copy_into that copies the most >> derived object into the holder's _relocbuf using placement new. >> >> Eliminated the implict conversion constructor from Relocation to holder that >> wordwise copied (to possibly beyond the end of) the Relocation into the >> holder's _relocbuf. We could have implemented this more carefully with the >> new approach (using copy_into), but we don't actually need this conversion. >> The only use of it was actually a wasted copy (in assembler_x86.cpp). >> >> Eliminated the use of placement new syntax via operator new with a holder >> argument to copy a Resource object into a holder. This included runtime >> verification that the size of the object is not greater than the size of >> _relocbuf; we now do corresponding verification at compile-time. This also >> included an incorrect attempt at a runtime check that the Relocation base >> class would be at the same address as the derived class being constructed; we >> now perform that check correctly. We also discuss in a comment the layout >> assumption being made (that isn't required by the standard but is provided by >> all supported compilers), and what to do if we encounter a compiler that >> behaves differently. >> >> Eliminated the idiom of making a default-constructed holder and then >> overwriting its held relocation with a newly constructed one, using the afore >> mentioned (and eliminated) operator new. Instead, RelocationHolder now has a >> factory function template (construct) for creating holders with a given >> relocation type, constructed using provided arguments. (The arguments are taken >> as const-ref rather than using perfect forwarding, as the tools for the latter >> are not (yet) approved for use in HotSpot. Using const-ref is good enough in >> this case.) >> >> Describe and verify other assumptions being made, such as all Relocation >> classes being trivially destructible. >> >> Testing: >> mach5 tier1-5 >> >> Future work: >> >> * RelocationHolder::reloc isn't const-correct. Making it so will require >> adjustment of some callers. I'll follow up with an RFE to address this. >> >> * Relocation classes have many virtual function overrides that are unmarked. >> I'll follow up with an RFE to add "override" specifiers. >> >> Potential issue: The removal of RelocationHolder(Relocation*) might not work >> for some platforms. I've tested on platforms supported by Oracle (where there >> was only one (mistaken) use). There might be uses by other platforms. > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > forgot to update comment Marked as reviewed by jvernee (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11618 From thartmann at openjdk.org Fri Dec 16 12:39:54 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 12:39:54 GMT Subject: RFR: 8297724: Loop strip mining prevents some empty loops from being eliminated In-Reply-To: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> References: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> Message-ID: On Thu, 15 Dec 2022 16:43:07 GMT, Roland Westrelin wrote: > When an empty loop is found, it's removed and as a consequence the > outer strip mine loop and the safepoint that it contains are also > removed. A counted loop is empty if it has the minimum number of nodes > that a well formed counted loop contains. In some cases, the loop has > extra nodes and the safepoint in the outer loop is the only node that > keeps those extra nodes alive. If the safepoint was to be removed, > then the counted loop would have the minimum number of nodes and be > considered empty. But the safepoint can't be removed until the loop is > considered empty which only happens if it has the minimum of nodes. As > a result, these loops are not removed. Note that now that the loop > strip mining loop nest is constructed even if UseCountedLoopSafepoints > is false, there's a regression where some loops used to be removed as > empty before but not anymore. > > The fix I propose is to extend IdealLoopTree::do_remove_empty_loop() > so it handles those cases. If it encounters a loop with no flow > control in the loop body but a number of nodes greater than the > minimum number of nodes, it starts from the extra nodes in the loop > body and follows uses until it finds a side effect, ignoring the > safepoint of the outer loop. If it finds none, then the extra nodes > can be removed and the loop is empty. This also works if the extra > nodes are kept alive by the safepoints of 2 different counted loops > and one can only be proven empty if the other one is as well (and the > other one proven empty if the first one is) and should work even if > there are more than 2 nodes involved.. Looks good to me. I'm running some testing and will report back once it passed. src/hotspot/share/opto/loopTransform.cpp line 3598: > 3596: CountedLoopNode *cl = _head->as_CountedLoop(); > 3597: #ifdef ASSERT > 3598: // Call collect_loop_core_nodes to exercise the assert that check that it finds the right number of nodes Suggestion: // Call collect_loop_core_nodes to exercise the assert that checks that it finds the right number of nodes test/hotspot/jtreg/compiler/c2/irTests/TestLSMMissedEmptyLoop.java line 30: > 28: /* > 29: * @test > 30: * @library /test/lib / Suggestion: * @test * @bug 8297724 * @library /test/lib / ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11699 From duke at openjdk.org Fri Dec 16 12:51:48 2022 From: duke at openjdk.org (Daniel Skantz) Date: Fri, 16 Dec 2022 12:51:48 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests [v2] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:05:39 GMT, Pengfei Li wrote: >> In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. >> >> This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. >> >> Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. >> >> A few more test cases are added within this patch as well. >> >> We tested the new IR rules on below kinds of CPUs. >> - AArch64 w/ 512-bit SVE >> - AArch64 w/ 128-bit SVE >> - AArch64 w/o SVE (NEON only) >> - x86 > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments If we want`UseAVX` or other platform-specific flags in annotations, maybe also need [JDK-8297490](https://bugs.openjdk.org/browse/JDK-8297490) ------------- PR: https://git.openjdk.org/jdk/pull/11687 From roland at openjdk.org Fri Dec 16 14:07:14 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 16 Dec 2022 14:07:14 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v3] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: On Thu, 15 Dec 2022 19:38:47 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Vladimir's review > > Nice. I will submit testing. > > You need second review. Thanks for the suggestions @vnkozlov @iwanowww. The new commit should address them. ------------- PR: https://git.openjdk.org/jdk/pull/11673 From roland at openjdk.org Fri Dec 16 14:07:14 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 16 Dec 2022 14:07:14 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v4] In-Reply-To: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: > The problem here is that for arrays, verification code computes a > number of meet operations that grows exponentially with the number of > dimensions while the number of unique meet operations that need to be > computed is a linear function of the number of dimensions: > > > // With verification code, the meet of A and B causes the computation of: > // 1- meet(A, B) > // 2- meet(B, A) > // 3- meet(dual(meet(A, B)), dual(A)) > // 4- meet(dual(meet(A, B)), dual(B)) > // 5- meet(dual(A), dual(B)) > // 6- meet(dual(B), dual(A)) > // 7- meet(dual(meet(dual(A), dual(B))), A) > // 8- meet(dual(meet(dual(A), dual(B))), B) > // > // In addition the meet of A[] and B[] requires the computation of the meet of A and B. > // > // The meet of A[] and B[] triggers the computation of: > // 1- meet(A[], B[][) > // 1.1- meet(A, B) > // 1.2- meet(B, A) > // 1.3- meet(dual(meet(A, B)), dual(A)) > // 1.4- meet(dual(meet(A, B)), dual(B)) > // 1.5- meet(dual(A), dual(B)) > // 1.6- meet(dual(B), dual(A)) > // 1.7- meet(dual(meet(dual(A), dual(B))), A) > // 1.8- meet(dual(meet(dual(A), dual(B))), B) > // 2- meet(B[], A[]) > // 2.1- meet(B, A) = 1.2 > // 2.2- meet(A, B) = 1.1 > // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 > // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 > // 2.5- meet(dual(B), dual(A)) = 1.6 > // 2.6- meet(dual(A), dual(B)) = 1.5 > // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 > // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 > // etc. > > > > There are a lot of redundant computations being performed. The fix I > propose is simply to cache the result of meet computations. So whene > the type system code is called to compute, for instance, the meet of > A[][] and B[][], the cache starts empty. Then as the meet computations > proceed, the cache is filled with meet result for meet of A[] and B[], > meet of A and B etc. Once the type system code returns with the result > for A[][] and B[][], the cache is cleared. > > With this, the test case I added goes from "never seem to ever finish" > to "complete in no time". Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11673/files - new: https://git.openjdk.org/jdk/pull/11673/files/0794dacb..b28fde54 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11673&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11673&range=02-03 Stats: 40 lines in 2 files changed: 15 ins; 9 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/11673.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11673/head:pull/11673 PR: https://git.openjdk.org/jdk/pull/11673 From roland at openjdk.org Fri Dec 16 14:07:17 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 16 Dec 2022 14:07:17 GMT Subject: RFR: 8297724: Loop strip mining prevents some empty loops from being eliminated [v2] In-Reply-To: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> References: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> Message-ID: > When an empty loop is found, it's removed and as a consequence the > outer strip mine loop and the safepoint that it contains are also > removed. A counted loop is empty if it has the minimum number of nodes > that a well formed counted loop contains. In some cases, the loop has > extra nodes and the safepoint in the outer loop is the only node that > keeps those extra nodes alive. If the safepoint was to be removed, > then the counted loop would have the minimum number of nodes and be > considered empty. But the safepoint can't be removed until the loop is > considered empty which only happens if it has the minimum of nodes. As > a result, these loops are not removed. Note that now that the loop > strip mining loop nest is constructed even if UseCountedLoopSafepoints > is false, there's a regression where some loops used to be removed as > empty before but not anymore. > > The fix I propose is to extend IdealLoopTree::do_remove_empty_loop() > so it handles those cases. If it encounters a loop with no flow > control in the loop body but a number of nodes greater than the > minimum number of nodes, it starts from the extra nodes in the loop > body and follows uses until it finds a side effect, ignoring the > safepoint of the outer loop. If it finds none, then the extra nodes > can be removed and the loop is empty. This also works if the extra > nodes are kept alive by the safepoints of 2 different counted loops > and one can only be proven empty if the other one is as well (and the > other one proven empty if the first one is) and should work even if > there are more than 2 nodes involved.. Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: - Update test/hotspot/jtreg/compiler/c2/irTests/TestLSMMissedEmptyLoop.java Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/loopTransform.cpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11699/files - new: https://git.openjdk.org/jdk/pull/11699/files/76fd3598..abcf9883 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11699&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11699&range=00-01 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11699.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11699/head:pull/11699 PR: https://git.openjdk.org/jdk/pull/11699 From roland at openjdk.org Fri Dec 16 14:15:53 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 16 Dec 2022 14:15:53 GMT Subject: [jdk20] RFR: 8298919: Add a regression test for JDK-8298520 In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:36:44 GMT, Tobias Hartmann wrote: > The fix for [JDK-8298520](https://bugs.openjdk.org/browse/JDK-8298520) does not include a regression test. In the meantime, the JavaFuzzer found one. Let's add a simplified version of it. > > Thanks, > Tobias Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk20/pull/44 From roland at openjdk.org Fri Dec 16 14:12:23 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 16 Dec 2022 14:12:23 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v5] In-Reply-To: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: <8NaxcHEYqpf0nKIFLEGq8Kx7A7rfbky66puZXQnt9Pk=.6d81b248-2eec-458f-9080-96a87efad420@github.com> > This PR re-does 6312651 (Compiler should only use verified interface > types for optimization) with a couple fixes I had pushed afterward > (8297556 and 8297343) and fixes for some other issues. > > The trickiest one is a fix for 8297345 (C2: SIGSEGV in > PhaseIdealLoop::push_pinned_nodes_thru_region) for which I added a > test case. The crash occurs because a (If (Bool (CmpP (LoadKlass ..)))) > only has a single projection. It lost the other projection because of > a CheckCastPP that becomes top. Initially the pattern is, in pseudo > code,: > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > obj itself is a CheckCastPP that's pinned at a dominating if. That > dominating if goes through split through phi. The LoadKlass for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > with phi1 = (Phi (LoadKlass obj) (LoadKlass obj)) and phi2 = (Phi obj obj) > with obj = (CheckCastPP#2 obj') > > PhiNode::Ideal() transforms phi2 into a new CheckCastPP: > (CheckCastPP#3 obj' obj') with control set to the region right above > the if in the pseudo code above. There happens to be another > CheckCastPP at the same control which casts obj' to a narrower > type. So the new CheckCastPP#3 is replaced by that one (because of > ConstraintCastNode::dominating_cast())and pseudo code becomes: > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > and then: > > if (phi1 == some_class) { > obj = top; > } > > because the types of the 2 CheckCastPPs conflict. That would be ok if: > > phi1 == some_class > > would constant fold. It would if the test was: > > if (CheckCastPP#4(obj').klass == some_klass) { > > but because of split if, the (CmpP (LoadKlass ..)) and the > CheckCastPP#1 ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > (Bool (CmpP (LoadKlass (AddP ..)))) > > down the same way (Bool (CmpP ..)) is cloned down. After split if, the > pseudo code becomes: > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > The bug can't occur because the CheckCastPP and (CmpP (LoadKlass ..)) > operate on the same phi input. The change in split_if.cpp implements > that. > > The other fixes are: > > - arraycopynode.cpp: a crash happens because dest_offset and > src_offset are the same. The call to transform that results in > src_scale, causes src_offset (and thus dest_offset) to become > dead. The fix is to add a hook node to preserve dest_offset. This is > unrelated to 6312651 but it triggers with that change for some > reason. > > - castnode.cpp: I removed CheckCastPPNode::Identity(), a piece of code > that the change in the handling of interfaces make obsolete and that > I missed in the PR for 6312651. > > - castnode.cpp: the change in CheckCastPPNode::Value() fixes a rare > assert when during CCP, Value() is called with an input raw constant > ptr. > > - type.cpp: a _klass = NULL field in arrays used to indicate only top > or bottom but I changed that so _klass is only guaranteed non null > for basic type arrays. The fix in type.cpp updates a piece of code > that I didn't adapt to the new meaning of _klass = NULL. > > - the other changes are due to StressReflectiveCode. With 6312651, a > CheckCastPP can fold to top if it sees a type for its input that > conflicts with its own type. That wasn't the case before. So if a > type check fails, a CheckCastPP will fold to top and the control > flow branch it's in must die. That doesn't always happen with > StressReflectiveCode: the CheckCastPP folds but not the control flow > path. With ExpandSubTypeCheckAtParseTime on, that's because of a > code path in LoadNode::Value() that's disabled with > StressReflectiveCode. With ExpandSubTypeCheckAtParseTime off, it's > because Compile::static_subtype_check() is always pessimistic with > StressReflectiveCode but it's used by SubTypeCheckNode::sub() to > find when a node can constant fold. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: removed new line ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11666/files - new: https://git.openjdk.org/jdk/pull/11666/files/848fc8df..92b5c3ae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11666&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11666.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11666/head:pull/11666 PR: https://git.openjdk.org/jdk/pull/11666 From thartmann at openjdk.org Fri Dec 16 14:42:56 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 14:42:56 GMT Subject: [jdk20] Integrated: 8298919: Add a regression test for JDK-8298520 In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:36:44 GMT, Tobias Hartmann wrote: > The fix for [JDK-8298520](https://bugs.openjdk.org/browse/JDK-8298520) does not include a regression test. In the meantime, the JavaFuzzer found one. Let's add a simplified version of it. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 9e10f00e Author: Tobias Hartmann URL: https://git.openjdk.org/jdk20/commit/9e10f00edbf37e5e5db8efc4f1e0c2a76541aab2 Stats: 57 lines in 1 file changed: 57 ins; 0 del; 0 mod 8298919: Add a regression test for JDK-8298520 Reviewed-by: chagedorn, roland ------------- PR: https://git.openjdk.org/jdk20/pull/44 From thartmann at openjdk.org Fri Dec 16 14:42:55 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 16 Dec 2022 14:42:55 GMT Subject: [jdk20] RFR: 8298919: Add a regression test for JDK-8298520 In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:36:44 GMT, Tobias Hartmann wrote: > The fix for [JDK-8298520](https://bugs.openjdk.org/browse/JDK-8298520) does not include a regression test. In the meantime, the JavaFuzzer found one. Let's add a simplified version of it. > > Thanks, > Tobias Thanks, Roland! ------------- PR: https://git.openjdk.org/jdk20/pull/44 From roland at openjdk.org Fri Dec 16 14:12:23 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 16 Dec 2022 14:12:23 GMT Subject: RFR: 8297933: [REDO] Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: <0Rw6Tt5Qcs1_2X6vDWxAzC73TtRPAChXfmgpG3PDRcU=.f7679aa6-6ff9-4727-886a-84a9da22ccd5@github.com> Message-ID: On Thu, 15 Dec 2022 21:21:47 GMT, Vladimir Ivanov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> extract interfaces->length() in ciInstanceKlass.cpp > > src/hotspot/share/ci/ciArrayKlass.hpp line 59: > >> 57: >> 58: static ciArrayKlass* make(ciType* element_type); >> 59: > > Redundant. A leftover from reverted changes? Removed. ------------- PR: https://git.openjdk.org/jdk/pull/11666 From roland at openjdk.org Fri Dec 16 14:18:51 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 16 Dec 2022 14:18:51 GMT Subject: RFR: 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI In-Reply-To: References: Message-ID: <-ngCoyxo6CAvXwmrHOwYLqqacjwRWt2pF4dh9431YrQ=.a052dfc5-4883-4afe-ba25-21bd062bba93@github.com> On Fri, 16 Dec 2022 09:37:26 GMT, Christian Hagedorn wrote: > [JDK-8292889](https://bugs.openjdk.org/browse/JDK-8292289) added the following optimization to `BoolNode::Ideal()` for patterns that include `CMoveI` nodes: > > https://github.com/openjdk/jdk/blob/fa322e40b68abf0a253040d14414d41f4e01e028/src/hotspot/share/opto/subnode.cpp#L1465-L1472 > > However, we could have a `CMoveI` during IGVN that will later be folded because the `Bool` condition node was replaced by a constant but IGVN has not processed this node, yet: > > > ![Screenshot from 2022-12-16 09-40-28](https://user-images.githubusercontent.com/17833009/208068197-4819b322-604c-412e-8898-3d3546a8a663.png) > > We fail when trying to call `as_Bool()` on `28 ConI`. The fix is straight forward to additionally check if we actually have a `BoolNode`. > > Thanks, > Christian Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk/pull/11705 From ecaspole at openjdk.org Fri Dec 16 16:33:54 2022 From: ecaspole at openjdk.org (Eric Caspole) Date: Fri, 16 Dec 2022 16:33:54 GMT Subject: Integrated: 8298809: Clean up vm/compiler/InterfaceCalls JMH In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 16:05:49 GMT, Eric Caspole wrote: > I removed some confusing less effective cases and modified and renamed some to cover what seem like the most useful cases with 1+ types and 1+ interfaces implemented in those types. Here is an example run: > > Benchmark Mode Cnt Score Error Units > InterfaceCalls.test1stInt2Types avgt 12 2.196 ? 0.022 ns/op > InterfaceCalls.test1stInt3Types avgt 12 8.259 ? 0.045 ns/op > InterfaceCalls.test1stInt5Types avgt 12 8.279 ? 0.024 ns/op > InterfaceCalls.test2ndInt2Types avgt 12 2.467 ? 0.023 ns/op > InterfaceCalls.test2ndInt3Types avgt 12 9.287 ? 0.032 ns/op > InterfaceCalls.test2ndInt5Types avgt 12 9.343 ? 0.027 ns/op > InterfaceCalls.testMonomorphic avgt 12 1.440 ? 0.031 ns/op This pull request has now been integrated. Changeset: 81e23ab3 Author: Eric Caspole URL: https://git.openjdk.org/jdk/commit/81e23ab3403a983ccddf27b1169a49e2ca061296 Stats: 178 lines in 1 file changed: 2 ins; 120 del; 56 mod 8298809: Clean up vm/compiler/InterfaceCalls JMH Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/11696 From vlivanov at openjdk.org Fri Dec 16 18:26:51 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 16 Dec 2022 18:26:51 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v4] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: On Fri, 16 Dec 2022 14:07:14 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/11673 From kvn at openjdk.org Fri Dec 16 19:02:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 19:02:50 GMT Subject: RFR: 8297582: C2: very slow compilation due to type system verification code [v4] In-Reply-To: References: <-qBf10EHEGrATQ6Dir1rgY_RAyR1bpsBT6-L4Klp4Yo=.04550e66-b93a-4a9a-b9ec-9d1408107760@github.com> Message-ID: <9ZuyqWz9tMDDDfU_fLrclBLR3Sbz202mYMsBY4qsN78=.a449f486-674d-4774-8f34-3967ba589739@github.com> On Fri, 16 Dec 2022 14:07:14 GMT, Roland Westrelin wrote: >> The problem here is that for arrays, verification code computes a >> number of meet operations that grows exponentially with the number of >> dimensions while the number of unique meet operations that need to be >> computed is a linear function of the number of dimensions: >> >> >> // With verification code, the meet of A and B causes the computation of: >> // 1- meet(A, B) >> // 2- meet(B, A) >> // 3- meet(dual(meet(A, B)), dual(A)) >> // 4- meet(dual(meet(A, B)), dual(B)) >> // 5- meet(dual(A), dual(B)) >> // 6- meet(dual(B), dual(A)) >> // 7- meet(dual(meet(dual(A), dual(B))), A) >> // 8- meet(dual(meet(dual(A), dual(B))), B) >> // >> // In addition the meet of A[] and B[] requires the computation of the meet of A and B. >> // >> // The meet of A[] and B[] triggers the computation of: >> // 1- meet(A[], B[][) >> // 1.1- meet(A, B) >> // 1.2- meet(B, A) >> // 1.3- meet(dual(meet(A, B)), dual(A)) >> // 1.4- meet(dual(meet(A, B)), dual(B)) >> // 1.5- meet(dual(A), dual(B)) >> // 1.6- meet(dual(B), dual(A)) >> // 1.7- meet(dual(meet(dual(A), dual(B))), A) >> // 1.8- meet(dual(meet(dual(A), dual(B))), B) >> // 2- meet(B[], A[]) >> // 2.1- meet(B, A) = 1.2 >> // 2.2- meet(A, B) = 1.1 >> // 2.3- meet(dual(meet(B, A)), dual(B)) = 1.4 >> // 2.4- meet(dual(meet(B, A)), dual(A)) = 1.3 >> // 2.5- meet(dual(B), dual(A)) = 1.6 >> // 2.6- meet(dual(A), dual(B)) = 1.5 >> // 2.7- meet(dual(meet(dual(B), dual(A))), B) = 1.8 >> // 2.8- meet(dual(meet(dual(B), dual(A))), B) = 1.7 >> // etc. >> >> >> >> There are a lot of redundant computations being performed. The fix I >> propose is simply to cache the result of meet computations. So whene >> the type system code is called to compute, for instance, the meet of >> A[][] and B[][], the cache starts empty. Then as the meet computations >> proceed, the cache is filled with meet result for meet of A[] and B[], >> meet of A and B etc. Once the type system code returns with the result >> for A[][] and B[][], the cache is cleared. >> >> With this, the test case I added goes from "never seem to ever finish" >> to "complete in no time". > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11673 From kvn at openjdk.org Fri Dec 16 19:55:51 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 19:55:51 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests [v2] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:05:39 GMT, Pengfei Li wrote: >> In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. >> >> This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. >> >> Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. >> >> A few more test cases are added within this patch as well. >> >> We tested the new IR rules on below kinds of CPUs. >> - AArch64 w/ 512-bit SVE >> - AArch64 w/ 128-bit SVE >> - AArch64 w/o SVE (NEON only) >> - x86 > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Looks good. Need second review. Preferable from some one familiar with x86 IR rules. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11687 From kvn at openjdk.org Fri Dec 16 19:55:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 19:55:52 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 20:23:16 GMT, Vladimir Kozlov wrote: >> In JDK-8183390, we introduced a new auto-vectorization testing framework in hotspot jtreg tests. The new framework provides us a much simpler way to add new test cases of auto-vectorizable loops. But the previous patch just added a check of the correctness of the result after vectorization. There is no check about whether the code is vectorized or not. >> >> This patch adds IR checks to verify C2's vectorization ability on test cases inside this framework. With this patch, each test method annotated with `@Test` is verified in two ways. First, it's invoked twice and the return results from the interpreter and C2 compiled code are compared. Second, the count of expected vector IR is checked by the IR framework if the test method has IR rule annotation. >> >> Ideally, we should check IR rules on all platforms. But in practice, the vectorization ability can be quite different on different platforms, or different generations of CPUs of one platform. So in this patch, we only add vectorizable checks for AArch64. Checks for other platforms (such as x86) can still be added later with more CPU feature conditions. We also add some negative rules (or in-vectorizable rules) with `@IR(failOn=...` on cases that should not be vectorized on any platform. >> >> A few more test cases are added within this patch as well. >> >> We tested the new IR rules on below kinds of CPUs. >> - AArch64 w/ 512-bit SVE >> - AArch64 w/ 128-bit SVE >> - AArch64 w/o SVE (NEON only) >> - x86 > > Don't forget to change `applyIfCPUFeature` to `applyIfCPUFeatureOr` > > And to run these tests with different AVX configuration we need to remove `vm.flagless` from `@requires`. `-XX:UseAVX=n` flag is supported by IR testing now. > BTW, previously I would like to add IR checks for x86 as well. But I don't have various generations of x86 machines to test so I don't have enough confidence to add them. Thanks @vnkozlov for your test effort on avx2. I would appreciate if someone from Intel (maybe @jbhateja) may help verify the x86 rules. You can do testing by using `-XX:UseSSE=n -XX:UseAVX=n` flags on modern (avx512) machine. To bypass `vm.flagless` you can `export TEST_VM_FLAGLESS=true`. That is what I did for testing. Thank you for explaining `vm.flagless` issue. In our testing environment we do run these and other Vector API and vectorization tests with different `-XX:UseSSE=n -XX:UseAVX=n` flags settings. That is why I asked about `vm.flagless`. Anyway, with added `avx` to IR testing filter we will run them on x86 and it is enough for now. ------------- PR: https://git.openjdk.org/jdk/pull/11687 From kvn at openjdk.org Fri Dec 16 20:00:51 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 20:00:51 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:53:42 GMT, Christian Hagedorn wrote: > Looks good! Maybe we could also think about printing an error message in case we passed an invalid address to that method. Yes, at least "Invalid address" line should be printed. The output does have "Executing printnm: 0x0000000000000000" already. ------------- PR: https://git.openjdk.org/jdk/pull/11697 From kvn at openjdk.org Fri Dec 16 20:32:48 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 20:32:48 GMT Subject: RFR: 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:37:26 GMT, Christian Hagedorn wrote: > [JDK-8292889](https://bugs.openjdk.org/browse/JDK-8292289) added the following optimization to `BoolNode::Ideal()` for patterns that include `CMoveI` nodes: > > https://github.com/openjdk/jdk/blob/fa322e40b68abf0a253040d14414d41f4e01e028/src/hotspot/share/opto/subnode.cpp#L1465-L1472 > > However, we could have a `CMoveI` during IGVN that will later be folded because the `Bool` condition node was replaced by a constant but IGVN has not processed this node, yet: > > > ![Screenshot from 2022-12-16 09-40-28](https://user-images.githubusercontent.com/17833009/208068197-4819b322-604c-412e-8898-3d3546a8a663.png) > > We fail when trying to call `as_Bool()` on `28 ConI`. The fix is straight forward to additionally check if we actually have a `BoolNode`. > > Thanks, > Christian Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11705 From kvn at openjdk.org Fri Dec 16 20:43:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 20:43:50 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts [v2] In-Reply-To: References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> Message-ID: On Fri, 16 Dec 2022 12:21:43 GMT, Emanuel Peter wrote: >> **Context** >> During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 >> Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. >> >> Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). >> Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. >> >> **Problem case** >> My jasm fuzzer produced some infinite loops that have the following form: >> The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. >> When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). >> >> Why did we not find this earlier? >> We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. >> >> **Solution** >> We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. >> >> I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Christian's review suggestion Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11706 From kvn at openjdk.org Fri Dec 16 20:45:48 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Dec 2022 20:45:48 GMT Subject: RFR: 8265688: Unused ciMethodType::ptype_at should be removed In-Reply-To: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> References: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> Message-ID: On Fri, 16 Dec 2022 10:28:09 GMT, Damon Fenacci wrote: > `ciMethodType::ptype_at` method is not used. > > Removing it. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11708 From kbarrett at openjdk.org Fri Dec 16 20:53:17 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 20:53:17 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v6] In-Reply-To: References: Message-ID: <3i5hznpGX4HdZr91LhE2QWEYvtq2AEmgzevEM52yWss=.6304f74c-d707-4176-a9ed-eded00cabfe5@github.com> > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > relocation type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - another comment tweak - Merge branch 'master' into relocation-holder - more comments about trivial destructor dependency - forgot to update comment - more comments about trivial relocation destructors - reinstate named args per dlong review - use alignas - simplify per jvernee - blank lines in include blocks - fix constructors and assigns ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11618/files - new: https://git.openjdk.org/jdk/pull/11618/files/f9c0642e..9c3ea010 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11618&range=04-05 Stats: 9222 lines in 391 files changed: 4307 ins; 3211 del; 1704 mod Patch: https://git.openjdk.org/jdk/pull/11618.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11618/head:pull/11618 PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Fri Dec 16 20:53:17 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 20:53:17 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v5] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 06:00:39 GMT, Kim Barrett wrote: >> Please review this change to construction and copying of the Relocation and >> RelocationHolder classes, to eliminate some questionable C++ usage. >> >> The array type for RelocationHandle::_relocbuf is changed from void* to char, >> because using a char array for raw memory is countenanced by the standard, >> while not so much for an array of void*. The desired alignment is maintained >> via a union, since using alignas is not (yet) permitted in HotSpot code. >> >> There is also now a comment discussing the use of _relocbuf in more detail, >> including some areas of continued sketchiness wrto standard conformance and >> reliance on implementation dependent behavior. >> >> No longer use trivial copy and assignment for RelocationHolder, since that >> isn't technically correct. The Relocation in the holder is not trivially >> copyable, since it is polymorphic. It seemed to work in practice with the >> supported compilers, but we shouldn't (and don't need to) rely on it. Instead >> we have a new virtual function Relocation::copy_into that copies the most >> derived object into the holder's _relocbuf using placement new. >> >> Eliminated the implict conversion constructor from Relocation to holder that >> wordwise copied (to possibly beyond the end of) the Relocation into the >> holder's _relocbuf. We could have implemented this more carefully with the >> new approach (using copy_into), but we don't actually need this conversion. >> The only use of it was actually a wasted copy (in assembler_x86.cpp). >> >> Eliminated the use of placement new syntax via operator new with a holder >> argument to copy a Resource object into a holder. This included runtime >> verification that the size of the object is not greater than the size of >> _relocbuf; we now do corresponding verification at compile-time. This also >> included an incorrect attempt at a runtime check that the Relocation base >> class would be at the same address as the derived class being constructed; we >> now perform that check correctly. We also discuss in a comment the layout >> assumption being made (that isn't required by the standard but is provided by >> all supported compilers), and what to do if we encounter a compiler that >> behaves differently. >> >> Eliminated the idiom of making a default-constructed holder and then >> overwriting its held relocation with a newly constructed one, using the afore >> mentioned (and eliminated) operator new. Instead, RelocationHolder now has a >> factory function template (construct) for creating holders with a given >> relocation type, constructed using provided arguments. (The arguments are taken >> as const-ref rather than using perfect forwarding, as the tools for the latter >> are not (yet) approved for use in HotSpot. Using const-ref is good enough in >> this case.) >> >> Describe and verify other assumptions being made, such as all Relocation >> classes being trivially destructible. >> >> Testing: >> mach5 tier1-5 >> >> Future work: >> >> * RelocationHolder::reloc isn't const-correct. Making it so will require >> adjustment of some callers. I'll follow up with an RFE to address this. >> >> * Relocation classes have many virtual function overrides that are unmarked. >> I'll follow up with an RFE to add "override" specifiers. >> >> Potential issue: The removal of RelocationHolder(Relocation*) might not work >> for some platforms. I've tested on platforms supported by Oracle (where there >> was only one (mistaken) use). There might be uses by other platforms. > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > forgot to update comment Thanks all for reviews and comments. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From kbarrett at openjdk.org Fri Dec 16 20:53:18 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Dec 2022 20:53:18 GMT Subject: Integrated: 8160404: RelocationHolder constructors have bugs In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 22:00:59 GMT, Kim Barrett wrote: > Please review this change to construction and copying of the Relocation and > RelocationHolder classes, to eliminate some questionable C++ usage. > > The array type for RelocationHandle::_relocbuf is changed from void* to char, > because using a char array for raw memory is countenanced by the standard, > while not so much for an array of void*. The desired alignment is maintained > via a union, since using alignas is not (yet) permitted in HotSpot code. > > There is also now a comment discussing the use of _relocbuf in more detail, > including some areas of continued sketchiness wrto standard conformance and > reliance on implementation dependent behavior. > > No longer use trivial copy and assignment for RelocationHolder, since that > isn't technically correct. The Relocation in the holder is not trivially > copyable, since it is polymorphic. It seemed to work in practice with the > supported compilers, but we shouldn't (and don't need to) rely on it. Instead > we have a new virtual function Relocation::copy_into that copies the most > derived object into the holder's _relocbuf using placement new. > > Eliminated the implict conversion constructor from Relocation to holder that > wordwise copied (to possibly beyond the end of) the Relocation into the > holder's _relocbuf. We could have implemented this more carefully with the > new approach (using copy_into), but we don't actually need this conversion. > The only use of it was actually a wasted copy (in assembler_x86.cpp). > > Eliminated the use of placement new syntax via operator new with a holder > argument to copy a Resource object into a holder. This included runtime > verification that the size of the object is not greater than the size of > _relocbuf; we now do corresponding verification at compile-time. This also > included an incorrect attempt at a runtime check that the Relocation base > class would be at the same address as the derived class being constructed; we > now perform that check correctly. We also discuss in a comment the layout > assumption being made (that isn't required by the standard but is provided by > all supported compilers), and what to do if we encounter a compiler that > behaves differently. > > Eliminated the idiom of making a default-constructed holder and then > overwriting its held relocation with a newly constructed one, using the afore > mentioned (and eliminated) operator new. Instead, RelocationHolder now has a > factory function template (construct) for creating holders with a given > relocation type, constructed using provided arguments. (The arguments are taken > as const-ref rather than using perfect forwarding, as the tools for the latter > are not (yet) approved for use in HotSpot. Using const-ref is good enough in > this case.) > > Describe and verify other assumptions being made, such as all Relocation > classes being trivially destructible. > > Testing: > mach5 tier1-5 > > Future work: > > * RelocationHolder::reloc isn't const-correct. Making it so will require > adjustment of some callers. I'll follow up with an RFE to address this. > > * Relocation classes have many virtual function overrides that are unmarked. > I'll follow up with an RFE to add "override" specifiers. > > Potential issue: The removal of RelocationHolder(Relocation*) might not work > for some platforms. I've tested on platforms supported by Oracle (where there > was only one (mistaken) use). There might be uses by other platforms. This pull request has now been integrated. Changeset: bfa921ae Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/bfa921ae6ce068c53dfa708d6d3d2cddbad5fc33 Stats: 252 lines in 3 files changed: 143 ins; 47 del; 62 mod 8160404: RelocationHolder constructors have bugs Reviewed-by: kvn, jrose, jvernee ------------- PR: https://git.openjdk.org/jdk/pull/11618 From duke at openjdk.org Sat Dec 17 01:25:52 2022 From: duke at openjdk.org (SUN Guoyun) Date: Sat, 17 Dec 2022 01:25:52 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: <9NvCsH7zbF4PZcw3p07BdYuUwQf0GLOVNBJ_JsckErk=.375a13a1-a6f4-4f71-82f5-824906e7c6a6@github.com> On Thu, 15 Dec 2022 02:52:51 GMT, SUN Guoyun wrote: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun OK. you can focus on the instruction generation of function com.sun.crypto.provider.DESCrypt::cipherBlock, sometimes it is

// result 63.4 ops/m
// B5->B112->B55

   B5: #   out( B112 B6 ) <- in( B4 )  Freq: 0.999996
    BRge   T0, A6, B112 #@branchConIU_reg_reg_short  P=0.000001 C=-1.000000
   ...
   B55: #  out( B56 ) <- in( B112 B53 B54 )  Freq: 3.54712e-05
    st_w    T6, [SP + #40]  # spill 9
    CALL,static #@CallStaticJavaDirect  wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant' debug_id='0')
   ILLTRAP   ;#@ShouldNotReachHere
   ...
  B112: # out( B55 ) <- in( B5 )  Freq: 1.01327e-06
  mov    T6, #0 #@loadConI
  JMP    B55 #@jmpDir_short
and sometimes it is

// result: 41.99 ops/m
// B5->B55

     B5: #   out( B55 B6 ) <- in( B4 )  Freq: 0.999996
     mov    A6, #0 #@loadConI
     BRge   T8, T6, B55 #@branchConIU_reg_reg_short  P=0.000001 C=-1.000000

     B55: #  out( B56 ) <- in( B112 B53 B54 )  Freq: 3.54712e-05
             st_w    A6, [SP + #4]  # spill 9
     CALL,static #@CallStaticJavaDirect  wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant' debug_id='0')
     ILLTRAP   ;#@ShouldNotReachHere
------------- PR: https://git.openjdk.org/jdk/pull/11685 From duke at openjdk.org Sat Dec 17 01:30:53 2022 From: duke at openjdk.org (SUN Guoyun) Date: Sat, 17 Dec 2022 01:30:53 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: <-J56d9xbDEuv1VnG1XIP_YOr_EBuJhyNxsSMe0R0tu0=.5e0e93f6-4f71-475a-9bfa-d4167664411d@github.com> On Thu, 15 Dec 2022 02:52:51 GMT, SUN Guoyun wrote: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun And in order to eliminate the fluctuation of running scores caused by C1, it is recommended that you use `-XX:-TieredCompilation` ------------- PR: https://git.openjdk.org/jdk/pull/11685 From duke at openjdk.org Sat Dec 17 01:59:50 2022 From: duke at openjdk.org (SUN Guoyun) Date: Sat, 17 Dec 2022 01:59:50 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 02:52:51 GMT, SUN Guoyun wrote: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun By the way, adding the following changes to this patch will allow crypto.aes to get a stable high score. But I'm not sure if this change can be committed with this patch.

diff --git a/src/hotspot/share/opto/coalesce.cpp b/src/hotspot/share/opto/coalesce.cpp
index b95987c4b09..82fbc72ad7e 100644                                           
--- a/src/hotspot/share/opto/coalesce.cpp                                       
+++ b/src/hotspot/share/opto/coalesce.cpp                                       
@@ -376,7 +376,7 @@ void PhaseAggressiveCoalesce::insert_copies( Matcher &matcher ) {
             LRG &lrg = lrgs(nidx);                                             
                                                                                
             // If this lrg has a high frequency use/def                        
-            if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {                  
+            if( lrg._maxfreq > _phc.high_frequency_lrg() ) {                   
But I'm not sure if this change can be committed with this patch or if a new patch can be committed for it. ------------- PR: https://git.openjdk.org/jdk/pull/11685 From dchuyko at openjdk.org Sat Dec 17 02:17:31 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Sat, 17 Dec 2022 02:17:31 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v4] In-Reply-To: References: Message-ID: > This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html > > In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. > > The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern introduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have a n 'immI_M1' input. > > New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also show the changed code with `-prof perfasm`. > > Typical nano-benchmark with a loop and a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usually there are enough registers. However special nano-benchmarks can be considered, e.g. > > > @Benchmark > @OperationsPerInvocation(TESTSIZE) > public int max0_use8_i() { > int sum = 0; > for(int i = 0; i < TESTSIZE; i++) { > use8(0, 1, 2, 3, 4, 5, 6, 7); > sum += Math.max(i, 0); > } > return sum; > } > > @CompilerControl(CompilerControl.Mode.DONT_INLINE) > public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { > } > > > Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. > > New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms (release build). > > Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: 392806Register, iRegIorL2I matched, m4 cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11570/files - new: https://git.openjdk.org/jdk/pull/11570/files/489a2118..101f5a04 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11570&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11570&range=02-03 Stats: 109 lines in 3 files changed: 26 ins; 30 del; 53 mod Patch: https://git.openjdk.org/jdk/pull/11570.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11570/head:pull/11570 PR: https://git.openjdk.org/jdk/pull/11570 From dchuyko at openjdk.org Sat Dec 17 02:18:04 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Sat, 17 Dec 2022 02:18:04 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: <-Yd26-7XVD93kSYineugXL4nBD-mZk907Gj6JyF-yD0=.3746c63c-5377-4d72-86e2-08fff25e27b3@github.com> References: <-Yd26-7XVD93kSYineugXL4nBD-mZk907Gj6JyF-yD0=.3746c63c-5377-4d72-86e2-08fff25e27b3@github.com> Message-ID: On Fri, 16 Dec 2022 02:37:22 GMT, Hao Sun wrote: >> Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: >> >> Reverted Ideal change, moved definitions to m4 > > src/hotspot/cpu/aarch64/aarch64_ad.m4 line 555: > >> 553: >> 554: ins_encode %{ >> 555: __ $2(as_Register($dst$$reg), > > I wonder if it would be better to use `$dst$$Register` here and several other sites in this patch. > Suggestion: > > __ $2($dst$$Register, Changed as suggested. I also made some cleanup in .m4 to and fixed minor .ad-.m4 mismatches. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From dchuyko at openjdk.org Sat Dec 17 02:25:49 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Sat, 17 Dec 2022 02:25:49 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v4] In-Reply-To: References: Message-ID: On Sat, 17 Dec 2022 02:17:31 GMT, Dmitry Chuyko wrote: >> This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html >> >> In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. >> >> The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern introduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have an 'immI_M1' input. >> >> New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also show the changed code with `-prof perfasm`. >> >> Typical nano-benchmark with a loop and a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usually there are enough registers. However special nano-benchmarks can be considered, e.g. >> >> >> @Benchmark >> @OperationsPerInvocation(TESTSIZE) >> public int max0_use8_i() { >> int sum = 0; >> for(int i = 0; i < TESTSIZE; i++) { >> use8(0, 1, 2, 3, 4, 5, 6, 7); >> sum += Math.max(i, 0); >> } >> return sum; >> } >> >> @CompilerControl(CompilerControl.Mode.DONT_INLINE) >> public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { >> } >> >> >> Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. >> >> New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms (release build). >> >> Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > 392806Register, iRegIorL2I matched, m4 cleanup I extended min/max matching to have iRegIorL2I operands. It is a bit controversial that iRegIorL2I can be passed down to instruction that operates iRegI but it works. iRegIorL2I can't be used in effect() which is needed to provide a mask for the instruct. The other solutions would be either to stay with max/min(iRegI, iRegI) only or not to issue split intruct-s. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From dcubed at openjdk.org Sat Dec 17 14:46:56 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Sat, 17 Dec 2022 14:46:56 GMT Subject: RFR: 8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax [v3] In-Reply-To: <0fGmbCvEUHokqoFR0ndPwd7yA5I_SS-MlBneCPP3LqY=.c4b67306-e5bb-4af5-8875-adba9ea9d0f1@github.com> References: <0fGmbCvEUHokqoFR0ndPwd7yA5I_SS-MlBneCPP3LqY=.c4b67306-e5bb-4af5-8875-adba9ea9d0f1@github.com> Message-ID: On Thu, 8 Dec 2022 09:01:08 GMT, Axel Boldt-Christmas wrote: >> Tests java/util/stream/test/org/openjdk/tests/java/util/* with -XX:+UseZGC -Xcomp -XX:-TieredCompilation crashes with `assert(regs[i] != regs[j]) failed: Multiple uses of register: rax`. More specifically compilation of java.util.concurrent.ForkJoinTask::awaitDone. >> >> The reason seems to be that the compare value and the memory input ends up sharing a register. (Uses Unsafe CAS which CAS an object reference into a field of that object, `oldval: rax` and `mem: [rax+offset]`). The Z load barrier stub dispatch implementation require that the reference and reference address occupy distinct registers. In the loadP nodes this is established by marking all but the memory TEMP which results in no sharing. >> >> This is not possible for the CompareAndSwapP / CompareAndExchangeP nodes as the compare value is an input node. >> >> The solution proposed here is less than ideal as it makes the CAS nodes require one extra TEMP register, which in the common case is unused. This puts unnecessary extra strain on the register allocation. The problem is that there is no way currently (that I can find) to express in .ad that a memory input must not share registers with a specific other input. >> >> There is an alternative solution for this specific crash which does not use a second TEMP register (see commit: cfd5ced4e97e986fc10c5a8721b543cd3101c58a). It accomplish this by using the same trick that the aarch64 Z CAS node uses which is to specify the memory as indirect which results in the address being LEA into a register. However from what I can see this does not guarantee that the address and the reference does not share a register (`oldval: rax` and `mem: [rax]`). So it is theoretically broken, (and so is the aarch64 implementation). >> >> It is unclear to me if there is ever a way for C2 to generation a CAS which compares the address of the field with its content. >> >> I call on anyone with more knowledge about `adlc` and `C2` for feedback. And specifically I want to open up a discussion with these points: >> * Is there some other way of expressing in the .ad file that a memory input should not share some register? >> * If not, is this a worthwhile RFE? As it seems to be a patterned used at least in other places in Z. >> * Will the indirect input ever share a register with oldval and/or are the aarch64/riscv implementations broken because of this? How about ppc? >> >> Testing: linux-x64 zgc tagged tests tier 1-7 and some specific crashing tests with `-XX:+UseZGC -Xcomp -XX:-TieredCompilation` (in: java/util/stream/, java/util/concurrent/) > > Axel Boldt-Christmas has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Remove problem listed tests > - Merge remote-tracking branch 'upstream_jdk/master' into JDK-8297235 > - indirect zXChgP as well > - indirect alternative > - JDK-8297235: ZGC: assert(regs[i] != regs[j]) failed: Multiple uses of register: rax This PR removed all entries from test/jdk/ProblemList-zgc.txt including this entry: jdk/internal/vm/Continuation/Fuzz.java#default 8298058 generic-x64 which has nothing to do with this bug fix. I'll restore that entry with a new bug shortly. ------------- PR: https://git.openjdk.org/jdk/pull/11410 From jrose at openjdk.org Sat Dec 17 18:15:55 2022 From: jrose at openjdk.org (John R Rose) Date: Sat, 17 Dec 2022 18:15:55 GMT Subject: RFR: 8160404: RelocationHolder constructors have bugs [v2] In-Reply-To: References: <7udKKpQ1XY438VTwGKXFuUoiRC4-PRiEQOxLbpEdZV8=.a4b60b9f-90f7-4dde-aaa9-90d3658f1ded@github.com> Message-ID: <7TzvMEFed1Kd4gWh1RL6Jz2xd8ECu317g2kfVSnfRVI=.8162b37c-de13-4616-a751-5189651c1277@github.com> On Thu, 15 Dec 2022 12:33:21 GMT, Jorn Vernee wrote: >> Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: >> >> blank lines in include blocks > > src/hotspot/share/code/relocInfo.hpp line 892: > >> 890: inline RelocationHolder::RelocationHolder() : >> 891: RelocationHolder(Construct(), [&] (void* p) { return ::new (p) Relocation(); }) >> 892: {} > > And this would become > Suggestion: > > inline RelocationHolder::RelocationHolder() : RelocationHolder(Construct()) > {} Nice suggestion. That makes the whole pattern a little more palatable, since it puts the magic new in one place instead of N places. ------------- PR: https://git.openjdk.org/jdk/pull/11618 From fgao at openjdk.org Mon Dec 19 01:14:53 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 19 Dec 2022 01:14:53 GMT Subject: Integrated: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point In-Reply-To: References: Message-ID: <4tH6BwNVFitJspMyU6104K7tN8lHGWj3bX4vWA7hYOY=.4b45b39f-3a0b-4d45-a566-9b5fac9107f7@github.com> On Wed, 14 Dec 2022 07:04:29 GMT, Fei Gao wrote: > The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. > > Take AddReductionVF with 128-bit as an example. > > Here is the assembly code before the patch: > > fadd s18, s17, s16 > mov v19.s[0], v16.s[1] > fadd s18, s18, s19 > mov v19.s[0], v16.s[2] > fadd s18, s18, s19 > mov v19.s[0], v16.s[3] > fadd s18, s18, s19 > > > Here is the assembly code after the patch: > > faddp v19.4s, v16.4s, v16.4s > faddp s18, v19.2s > fadd s18, s18, s17 > > > As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. > > But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: > > 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. > > 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. > > Tier 1~3 passed with no new failures on Linux AArch64 platform. > > Here is the perf data of jmh benchmark [3] for the patch: > > Benchmark size Mode Cnt Before After Units > Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms > Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms > Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms > > [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- > https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- > [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 > https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 > https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316 This pull request has now been integrated. Changeset: ba942c24 Author: Fei Gao Committer: Ningsheng Jian URL: https://git.openjdk.org/jdk/commit/ba942c24e8894f4422870fb53253f5946dc4f0d1 Stats: 512 lines in 5 files changed: 44 ins; 16 del; 452 mod 8298244: AArch64: Optimize vector implementation of AddReduction for floating point Reviewed-by: aph, xgong ------------- PR: https://git.openjdk.org/jdk/pull/11663 From pli at openjdk.org Mon Dec 19 02:44:47 2022 From: pli at openjdk.org (Pengfei Li) Date: Mon, 19 Dec 2022 02:44:47 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests [v2] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 12:48:41 GMT, Daniel Skantz wrote: > If we wantUseAVX or other platform-specific flags in annotations, maybe also need [JDK-8297490](https://bugs.openjdk.org/browse/JDK-8297490) @danielogh Thanks for letting me know this. As I have discussed with kvn, currently we have to keep `vm.flagless` in these tests. Users cannot select non-default UseSSE/UseAVX values without the `export TEST_VM_FLAGLESS=true` workaround. So, I don't think I need to add platform-specific flags in `@applyIf` for now. ------------- PR: https://git.openjdk.org/jdk/pull/11687 From pli at openjdk.org Mon Dec 19 02:48:54 2022 From: pli at openjdk.org (Pengfei Li) Date: Mon, 19 Dec 2022 02:48:54 GMT Subject: RFR: 8298632: [TESTBUG] Add IR checks in jtreg vectorization tests [v2] In-Reply-To: References: Message-ID: <4zT84cDMc31afIgo7iB3OW-V2hpUVnKkzp_-Rk9mWhQ=.04eb5169-c3eb-4eef-acc7-42201e13b7ab@github.com> On Fri, 16 Dec 2022 19:52:46 GMT, Vladimir Kozlov wrote: > Looks good. > > Need second review. Preferable from some one familiar with x86 IR rules. Thanks @vnkozlov for the review. Can I have a second review for x86? ------------- PR: https://git.openjdk.org/jdk/pull/11687 From thartmann at openjdk.org Mon Dec 19 06:13:49 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 19 Dec 2022 06:13:49 GMT Subject: RFR: 8297724: Loop strip mining prevents some empty loops from being eliminated [v2] In-Reply-To: References: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> Message-ID: On Fri, 16 Dec 2022 14:07:17 GMT, Roland Westrelin wrote: >> When an empty loop is found, it's removed and as a consequence the >> outer strip mine loop and the safepoint that it contains are also >> removed. A counted loop is empty if it has the minimum number of nodes >> that a well formed counted loop contains. In some cases, the loop has >> extra nodes and the safepoint in the outer loop is the only node that >> keeps those extra nodes alive. If the safepoint was to be removed, >> then the counted loop would have the minimum number of nodes and be >> considered empty. But the safepoint can't be removed until the loop is >> considered empty which only happens if it has the minimum of nodes. As >> a result, these loops are not removed. Note that now that the loop >> strip mining loop nest is constructed even if UseCountedLoopSafepoints >> is false, there's a regression where some loops used to be removed as >> empty before but not anymore. >> >> The fix I propose is to extend IdealLoopTree::do_remove_empty_loop() >> so it handles those cases. If it encounters a loop with no flow >> control in the loop body but a number of nodes greater than the >> minimum number of nodes, it starts from the extra nodes in the loop >> body and follows uses until it finds a side effect, ignoring the >> safepoint of the outer loop. If it finds none, then the extra nodes >> can be removed and the loop is empty. This also works if the extra >> nodes are kept alive by the safepoints of 2 different counted loops >> and one can only be proven empty if the other one is as well (and the >> other one proven empty if the first one is) and should work even if >> there are more than 2 nodes involved.. > > Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/c2/irTests/TestLSMMissedEmptyLoop.java > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/loopTransform.cpp > > Co-authored-by: Tobias Hartmann All tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11699 From thartmann at openjdk.org Mon Dec 19 06:19:49 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 19 Dec 2022 06:19:49 GMT Subject: RFR: 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:37:26 GMT, Christian Hagedorn wrote: > [JDK-8292889](https://bugs.openjdk.org/browse/JDK-8292289) added the following optimization to `BoolNode::Ideal()` for patterns that include `CMoveI` nodes: > > https://github.com/openjdk/jdk/blob/fa322e40b68abf0a253040d14414d41f4e01e028/src/hotspot/share/opto/subnode.cpp#L1465-L1472 > > However, we could have a `CMoveI` during IGVN that will later be folded because the `Bool` condition node was replaced by a constant but IGVN has not processed this node, yet: > > > ![Screenshot from 2022-12-16 09-40-28](https://user-images.githubusercontent.com/17833009/208068197-4819b322-604c-412e-8898-3d3546a8a663.png) > > We fail when trying to call `as_Bool()` on `28 ConI`. The fix is straight forward to additionally check if we actually have a `BoolNode`. > > Thanks, > Christian Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11705 From thartmann at openjdk.org Mon Dec 19 06:21:54 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 19 Dec 2022 06:21:54 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts [v2] In-Reply-To: References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> Message-ID: <-t0PJRJEikyPjojYJ10OcgjhJVDKIqEsFuCS5yQ4u2k=.4c4e4895-570a-44c0-b471-a447478bc262@github.com> On Fri, 16 Dec 2022 12:21:43 GMT, Emanuel Peter wrote: >> **Context** >> During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). >> https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 >> Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. >> >> Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). >> Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. >> >> **Problem case** >> My jasm fuzzer produced some infinite loops that have the following form: >> The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. >> When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). >> >> Why did we not find this earlier? >> We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. >> >> **Solution** >> We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. >> >> I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Christian's review suggestion Looks good to me. test/hotspot/jtreg/compiler/loopopts/TestInfiniteLoopWithUnmergedBackedgesMain.java line 43: > 41: } > 42: > 43: Suggestion: ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11706 From thartmann at openjdk.org Mon Dec 19 06:48:51 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 19 Dec 2022 06:48:51 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 02:52:51 GMT, SUN Guoyun wrote: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun Marked as reviewed by thartmann (Reviewer). Tests passed and performance results looks good. I think the other change that you proposed should not be part of this. ------------- PR: https://git.openjdk.org/jdk/pull/11685 From xgong at openjdk.org Mon Dec 19 07:01:49 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 19 Dec 2022 07:01:49 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 21:07:28 GMT, Dmitry Chuyko wrote: >> This is a return to an older optimization suggestion for integer Math.min() and Math.max(). See original thread at https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-April/022367.html >> >> In a case when one of the arguments is a constant 0, 1 or -1 this constant is currently materialized (like zero is moved to a dedicated register). Instead of that we can produce denoted values from ZR zero register using CSEL, CSINC or CSINV. Thus the register usage and the load instruction are removed. >> >> The implementation adds 3 additional matching rules for min and 3 for max in aarch64.ad file. As constants currently can be in any MinI/MaxI node input, ideal transformation for that nodes is changed to put a constant into the first input. It allows to have a single rule for each value instead of two. First input is not very natural, it is selected because of https://github.com/openjdk/jdk/pull/3513 optimization that added right-spline transformation for MinI/MaxI. I think it can be symmetrically changed to left-spline but it was not a subject of this PR. Each match rule generates 2 'instruct' intructions following the pattern introduced in https://github.com/openjdk/jdk/commit/fde854e03779e6809279dbf85b0645eb49d8736a. They are newly added ones with one peculiarity. They try to follow regular naming scheme but lack one of operands that corresponds to a constant that is not actually needed. E.g. 'instruct cmovI_reg_immM1_ge(iRegINoSp dst, iRegI src1, rFlagsReg cr)' that don't have an 'immI_M1' input. >> >> New TestMinMaxIntrinsics jtreg test compares results produced by intrinsics for generic and specialized versions with Java implementation of min and max. Intrinsics are used in lambdas that are compiled with the Whitebox API. The cases include -1, 0, 1 and a couple of regular values. This test can be used to check the generated assembly by adding `-XX:+PrintCompilation -XX:+PrintOptoAssembly`. Nano-benchmarks where a specialized version is called from a not inlined method also show the changed code with `-prof perfasm`. >> >> Typical nano-benchmark with a loop and a Blackhole over array shows no difference in performance as the constant is anyway moved out of the loop and usually there are enough registers. However special nano-benchmarks can be considered, e.g. >> >> >> @Benchmark >> @OperationsPerInvocation(TESTSIZE) >> public int max0_use8_i() { >> int sum = 0; >> for(int i = 0; i < TESTSIZE; i++) { >> use8(0, 1, 2, 3, 4, 5, 6, 7); >> sum += Math.max(i, 0); >> } >> return sum; >> } >> >> @CompilerControl(CompilerControl.Mode.DONT_INLINE) >> public void use8(int p0, int p1, int p2, int p3, int p4, int p5, int p6, int p7) { >> } >> >> >> Saving ~1 L1 icache load makes it ~9% faster, and more than a half of the cost is use8() helper. >> >> New version passes new TestMinMaxIntrinsics test on x86 and aarch64 and tier1,2 tests on that platforms (release build). >> >> Zero case is especially interesting. In general, I wonder how it could be possible to allocate ZR as a register for zero constants so the load there could be a no-op and we might get rid of special rules with immI0. > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Reverted Ideal change, moved definitions to m4 src/hotspot/cpu/aarch64/aarch64.ad line 13856: > 13854: cmovI_reg_imm0_lt(dst, src, cr); > 13855: %} > 13856: %} It sounds more sense to me if the ideal can make sure the constant input is the right child like what you did in the last commit. So that we only need the first rule like other commutative ops. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From chagedorn at openjdk.org Mon Dec 19 07:13:57 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Dec 2022 07:13:57 GMT Subject: RFR: 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI In-Reply-To: <-ngCoyxo6CAvXwmrHOwYLqqacjwRWt2pF4dh9431YrQ=.a052dfc5-4883-4afe-ba25-21bd062bba93@github.com> References: <-ngCoyxo6CAvXwmrHOwYLqqacjwRWt2pF4dh9431YrQ=.a052dfc5-4883-4afe-ba25-21bd062bba93@github.com> Message-ID: On Fri, 16 Dec 2022 14:15:35 GMT, Roland Westrelin wrote: >> [JDK-8292889](https://bugs.openjdk.org/browse/JDK-8292289) added the following optimization to `BoolNode::Ideal()` for patterns that include `CMoveI` nodes: >> >> https://github.com/openjdk/jdk/blob/fa322e40b68abf0a253040d14414d41f4e01e028/src/hotspot/share/opto/subnode.cpp#L1465-L1472 >> >> However, we could have a `CMoveI` during IGVN that will later be folded because the `Bool` condition node was replaced by a constant but IGVN has not processed this node, yet: >> >> >> ![Screenshot from 2022-12-16 09-40-28](https://user-images.githubusercontent.com/17833009/208068197-4819b322-604c-412e-8898-3d3546a8a663.png) >> >> We fail when trying to call `as_Bool()` on `28 ConI`. The fix is straight forward to additionally check if we actually have a `BoolNode`. >> >> Thanks, >> Christian > > Looks good to me. Thanks @rwestrel, @vnkozlov and @TobiHartmann for your reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11705 From chagedorn at openjdk.org Mon Dec 19 07:13:58 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Dec 2022 07:13:58 GMT Subject: Integrated: 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 09:37:26 GMT, Christian Hagedorn wrote: > [JDK-8292889](https://bugs.openjdk.org/browse/JDK-8292289) added the following optimization to `BoolNode::Ideal()` for patterns that include `CMoveI` nodes: > > https://github.com/openjdk/jdk/blob/fa322e40b68abf0a253040d14414d41f4e01e028/src/hotspot/share/opto/subnode.cpp#L1465-L1472 > > However, we could have a `CMoveI` during IGVN that will later be folded because the `Bool` condition node was replaced by a constant but IGVN has not processed this node, yet: > > > ![Screenshot from 2022-12-16 09-40-28](https://user-images.githubusercontent.com/17833009/208068197-4819b322-604c-412e-8898-3d3546a8a663.png) > > We fail when trying to call `as_Bool()` on `28 ConI`. The fix is straight forward to additionally check if we actually have a `BoolNode`. > > Thanks, > Christian This pull request has now been integrated. Changeset: 5e678f75 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/5e678f7500e514f04637c546959613d4688f989c Stats: 53 lines in 2 files changed: 52 ins; 0 del; 1 mod 8298824: C2 crash: assert(is_Bool()) failed: invalid node class: ConI Reviewed-by: roland, kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11705 From duke at openjdk.org Mon Dec 19 07:44:51 2022 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 19 Dec 2022 07:44:51 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate [v2] In-Reply-To: References: Message-ID: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun SUN Guoyun has updated the pull request incrementally with one additional commit since the last revision: 8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11685/files - new: https://git.openjdk.org/jdk/pull/11685/files/f017e49c..c2bf1d76 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11685&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11685&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11685.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11685/head:pull/11685 PR: https://git.openjdk.org/jdk/pull/11685 From duke at openjdk.org Mon Dec 19 07:59:56 2022 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 19 Dec 2022 07:59:56 GMT Subject: Integrated: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 02:52:51 GMT, SUN Guoyun wrote: > Hi all, > For C2, convert double to float cause a loss of precision, > >

> ./chaitin.cpp:221
> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
> 
> > Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: > >

> ./coalesce.cpp:379
> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>    ...
> }
> 
> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. > > There are two cases that I tested for SPECjvm2008 crypto.aes. > case 1: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0 #double->float
> d = 16.994714324523816
> f = 16.9947147
> 
> //coalesce.cpp:379
> // fcvt.d.s $f0,$f0 #float->double
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
> 
> > case2: >

> //chaitin.cpp:221
> // fcvt.s.d $f0,$f0
> d = 16.996332681816536
> f = 16.9963322
> 
> //coalesce.cpp
> // fcvt.d.s $f0,$f0
> // fcmp.sle.d $fcc2,$f0,$f1
> (gdb) i r fa0
> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
> (gdb) i r fa1
> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
> 
> > The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. > > This is a patch to fix this problem. Please help review it. > > Thanks, > Sun Guoyun This pull request has now been integrated. Changeset: 36376605 Author: sunguoyun Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/36376605215ba3380bfc07752eec043af04a5c29 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11685 From duke at openjdk.org Mon Dec 19 08:12:50 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:12:50 GMT Subject: RFR: 8265688: Unused ciMethodType::ptype_at should be removed In-Reply-To: References: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> Message-ID: <_FPm951T6hKw7kwpPKt-8s9suCRNby7naVwxmlK5NIo=.b5ed97b4-c960-41d1-8abd-9b357c2871d6@github.com> On Fri, 16 Dec 2022 10:39:20 GMT, Tobias Hartmann wrote: >> `ciMethodType::ptype_at` method is not used. >> >> Removing it. > > Looks good and trivial. @TobiHartmann @vnkozlov thanks a lot for reviewing it! ------------- PR: https://git.openjdk.org/jdk/pull/11708 From epeter at openjdk.org Mon Dec 19 08:14:01 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 19 Dec 2022 08:14:01 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts [v3] In-Reply-To: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> Message-ID: > **Context** > During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 > Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. > > Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). > Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. > > **Problem case** > My jasm fuzzer produced some infinite loops that have the following form: > The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. > When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). > > Why did we not find this earlier? > We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. > > **Solution** > We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. > > I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopopts/TestInfiniteLoopWithUnmergedBackedgesMain.java remove redundant empty line Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11706/files - new: https://git.openjdk.org/jdk/pull/11706/files/2a02b338..17d60703 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11706&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11706&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11706.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11706/head:pull/11706 PR: https://git.openjdk.org/jdk/pull/11706 From dchuyko at openjdk.org Mon Dec 19 08:24:53 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Mon, 19 Dec 2022 08:24:53 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: References: Message-ID: <-Q_7yKRsu0Ei2db8zQSNj2pPEtjEgaKrx6W4bDzd1KM=.1804e645-4cee-4e41-9f0b-5b43a47a178f@github.com> On Mon, 19 Dec 2022 06:58:18 GMT, Xiaohong Gong wrote: > It sounds more sense to me if the ideal can make sure the constant input is the right child like what you did in the last commit. So that we only need the first rule like other commutative ops. To achieve that we will have to change max(max[...]) and max(add[...]) optimizations. For example the latter likely will have to look into both inputs. I.e. if we order Max to the left and Con to the right, Add can appear at any side etc. - any fixed order for 2 of 3 means not guaranteed order for the third one. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From duke at openjdk.org Mon Dec 19 08:36:57 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:36:57 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference [v2] In-Reply-To: References: Message-ID: > `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. > Checking for NULL reference before checking if blob is a method. Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8297801: print message when address invalid ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11697/files - new: https://git.openjdk.org/jdk/pull/11697/files/e6c06762..7a93cadd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11697&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11697&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11697.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11697/head:pull/11697 PR: https://git.openjdk.org/jdk/pull/11697 From duke at openjdk.org Mon Dec 19 08:36:58 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:36:58 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference [v2] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 08:31:10 GMT, Tobias Hartmann wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8297801: print message when address invalid > > Looks good to me. @TobiHartmann @chhagedorn @vnkozlov thanks for the reviews. I added the "Invalid address" print statement. ------------- PR: https://git.openjdk.org/jdk/pull/11697 From duke at openjdk.org Mon Dec 19 08:37:55 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:37:55 GMT Subject: Integrated: 8265688: Unused ciMethodType::ptype_at should be removed In-Reply-To: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> References: <94gNzMRxD3krKE-v_RN8PyaZoyXd0bc9213kjxVsxKY=.ff6c73b1-12f8-4a4a-bf34-550f48c9354a@github.com> Message-ID: On Fri, 16 Dec 2022 10:28:09 GMT, Damon Fenacci wrote: > `ciMethodType::ptype_at` method is not used. > > Removing it. This pull request has now been integrated. Changeset: 16225630 Author: Damon Fenacci Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/16225630ec3d4943e359f7a8b0f531429bb434c8 Stats: 12 lines in 2 files changed: 0 ins; 9 del; 3 mod 8265688: Unused ciMethodType::ptype_at should be removed Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11708 From duke at openjdk.org Mon Dec 19 08:42:48 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:42:48 GMT Subject: RFR: 8298736: Revisit usages of log10 in compiler code In-Reply-To: <6jZjyPnQWx5MezVvn8n4pIVyrR9_9vqV3iOcXxUVvKM=.261f4514-b2a4-41cf-a99c-98009436e560@github.com> References: <6jZjyPnQWx5MezVvn8n4pIVyrR9_9vqV3iOcXxUVvKM=.261f4514-b2a4-41cf-a99c-98009436e560@github.com> Message-ID: On Thu, 15 Dec 2022 13:14:57 GMT, Tobias Hartmann wrote: >> The use of the Math library `log10` function causes an overloading ambiguity error on SPARC when using it with integer typed parameters. >> >> * adding a `static_cast` to the parameter >> * using the lib `log10` function (with `static_cast`s) instead of a custom one in `src/hotspot/share/opto/node.cpp` > > Looks good to me too. @TobiHartmann @chhagedorn @eme64 thanks for your reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11686 From duke at openjdk.org Mon Dec 19 08:47:54 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:47:54 GMT Subject: RFR: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: <207bzPTXcKPuZjIifJzMdApzShAEHTKP6OH9qS3YCbQ=.8b1f402a-4dc6-42c8-8483-9860c102c9cf@github.com> On Tue, 13 Dec 2022 16:57:29 GMT, Tobias Hartmann wrote: >> Changed return type of `CompileTask::compile_id()` from `int` to `uint`. >> Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. >> Added *asserts* to check for valid value range where not possible. > > Thanks Tom. The intention of this change was mainly consistency, not avoiding a potential overflow. @TobiHartmann @tkrodriguez @dougxc thanks for your reviews. ------------- PR: https://git.openjdk.org/jdk/pull/11630 From duke at openjdk.org Mon Dec 19 08:49:58 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:49:58 GMT Subject: Integrated: 8298736: Revisit usages of log10 in compiler code In-Reply-To: References: Message-ID: <_Gx-dTaKa9o0ubZiys_VMptNlNWpjSsLP8s7YTLCO_g=.26cbe8eb-091c-4454-9905-f7d001f61a57@github.com> On Thu, 15 Dec 2022 07:54:18 GMT, Damon Fenacci wrote: > The use of the Math library `log10` function causes an overloading ambiguity error on SPARC when using it with integer typed parameters. > > * adding a `static_cast` to the parameter > * using the lib `log10` function (with `static_cast`s) instead of a custom one in `src/hotspot/share/opto/node.cpp` This pull request has now been integrated. Changeset: ec959914 Author: Damon Fenacci Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/ec95991470a99c917f757614fc6d2cd883bdb39b Stats: 14 lines in 2 files changed: 0 ins; 11 del; 3 mod 8298736: Revisit usages of log10 in compiler code Reviewed-by: thartmann, chagedorn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/11686 From thartmann at openjdk.org Mon Dec 19 08:52:50 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 19 Dec 2022 08:52:50 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference [v2] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 08:36:57 GMT, Damon Fenacci wrote: >> `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. >> Checking for NULL reference before checking if blob is a method. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8297801: print message when address invalid Still looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11697 From duke at openjdk.org Mon Dec 19 08:53:57 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 08:53:57 GMT Subject: Integrated: 8295661: CompileTask::compile_id() should be passed as int In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 11:29:55 GMT, Damon Fenacci wrote: > Changed return type of `CompileTask::compile_id()` from `int` to `uint`. > Also modified the type everywhere `compile_id` is used (parameters, local variables, string formatting, constant assignments) *as much as possible*. > Added *asserts* to check for valid value range where not possible. This pull request has now been integrated. Changeset: 8e49fcdd Author: Damon Fenacci Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/8e49fcdde4fef5a8db36823b35d409ba2c9ec47b Stats: 47 lines in 9 files changed: 0 ins; 2 del; 45 mod 8295661: CompileTask::compile_id() should be passed as int Reviewed-by: thartmann, dnsimon, never ------------- PR: https://git.openjdk.org/jdk/pull/11630 From kbarrett at openjdk.org Mon Dec 19 09:06:44 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 19 Dec 2022 09:06:44 GMT Subject: RFR: 8298913: Add override qualifiers to Relocation classes Message-ID: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> Please review this change to the Relocation classes, adding `override` qualifiers to all overriding virtual function declarations. Testing: mach5 tier1 Note that building with clang (on macosx) subjected these changes to clang's `-Winconsistent-missing-override` option. Note that there are a lot of member functions missing `const` qualifiers in the Relocation hierarchy. I may look at improving const-correctness later. After doing a little exploration down that path, it looks like it might be a bit messy. ------------- Commit messages: - add override qualifiers Changes: https://git.openjdk.org/jdk/pull/11716/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11716&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298913 Stats: 58 lines in 1 file changed: 0 ins; 0 del; 58 mod Patch: https://git.openjdk.org/jdk/pull/11716.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11716/head:pull/11716 PR: https://git.openjdk.org/jdk/pull/11716 From chagedorn at openjdk.org Mon Dec 19 10:03:51 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Dec 2022 10:03:51 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference [v2] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 08:36:57 GMT, Damon Fenacci wrote: >> `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. >> Checking for NULL reference before checking if blob is a method. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8297801: print message when address invalid Thanks for adding the message, looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11697 From mbaesken at openjdk.org Mon Dec 19 12:19:44 2022 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 19 Dec 2022 12:19:44 GMT Subject: RFR: JDK-8299022: Linux ppc64le build issues after JDK-8160404 Message-ID: Looks like [JDK-8160404](https://bugs.openjdk.org/browse/JDK-8160404) caused issues in the Linux ppc64le build. We now run into ------------- Commit messages: - JDK-8299022 Changes: https://git.openjdk.org/jdk/pull/11719/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11719&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299022 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11719.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11719/head:pull/11719 PR: https://git.openjdk.org/jdk/pull/11719 From epeter at openjdk.org Mon Dec 19 12:25:00 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 19 Dec 2022 12:25:00 GMT Subject: RFR: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts [v2] In-Reply-To: <-t0PJRJEikyPjojYJ10OcgjhJVDKIqEsFuCS5yQ4u2k=.4c4e4895-570a-44c0-b471-a447478bc262@github.com> References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> <-t0PJRJEikyPjojYJ10OcgjhJVDKIqEsFuCS5yQ4u2k=.4c4e4895-570a-44c0-b471-a447478bc262@github.com> Message-ID: On Mon, 19 Dec 2022 06:19:52 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Christian's review suggestion > > Looks good to me. Thanks @TobiHartmann @chhagedorn @vnkozlov for the discussions and reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11706 From epeter at openjdk.org Mon Dec 19 12:25:03 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 19 Dec 2022 12:25:03 GMT Subject: Integrated: 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts In-Reply-To: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> References: <5-8UtfPNSyYrXf274GWDUs34oVWLaH6JEG_38Dr4eUQ=.501fd079-2e90-4ba7-899c-99cf1808201a@github.com> Message-ID: On Fri, 16 Dec 2022 09:57:35 GMT, Emanuel Peter wrote: > **Context** > During parsing, we insert SafePoints if we jump from higher to lower bci (`maybe_add_safepoint` is called for every if, goto etc). > https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/parse.hpp#L490-L494 > Generally, this alligns with backedges: the assumption is that the loop-head sits at the smallest bci of all blocks in the loop. So every jump back to the loop-head goes from higher to lower bci, hence we place a SafePoint just before the jump. > > Also: the first `build_loop_tree` may not attach an infinite loop to the loop-tree. If during the same loop-opts-phase we go to `beautify_loops` and it requires us rebuilding the loop-tree (eg because some other loop did `merge_many_backedges`), we call `build_loop_tree` again, and this time around we do detect the infinite loop (it now has a NeverBranch exit, so it is attached because of that). > Afterwards, we call `IdealLoopTree::check_safepts`, which tries to find SafePoints on all backedges. Normally, we have SafePoints on all backedges, just before we go back to the head. > > **Problem case** > My jasm fuzzer produced some infinite loops that have the following form: > The loop head is not at the smallest bci (bytecode index) of all blocks in the loop. So the SafePoints are placed somewhere in the body of the loop, just before an if branches into the two backedges. Because this is an infinite loop, it is only attached to the loop-tree in `build_loop_tree` after `beautify_loops`, so the two backedges were not merged. > When we call `IdealLoopTree::check_safepts`, we start with the inner loop, where we find the SafePoint above the if. Then we go to the outer loop. We don't find a SafePoint before we find the inner body. Now we decide to skip the inner body (which implies skipping the SafePoint in the body). The code assumes after skipping the inner loop, we are still in the outer loop. This is not true, because inner and outer loop have the same loop head (the backedges were not merged). We trigger an assert that checks that we are still in the outer loop (`nested loop`). > > Why did we not find this earlier? > We have not extensively tested infinite loops before. Also, we have not tested loops with loop-heads that are not at the smallest bci of the loop. However, with my bytecode fuzzer I can find these issues. It is also more likely with irreducible loops: there at least one loop-entry cannot be at the smallest bci. Irreducible loops are not processed by `maybe_add_safepoint`, but once it only has a singe entry, it is not irreducible any more, and so it can happen that a loop-entry becomes loop head that does not have the smallest bci. > > **Solution** > We could fix `maybe_add_safepoint` to not depend on bci, but rather the loop-tree from `ciTypeFlow`. That would be complex, and risky. That is not justified just for infinite loops, and even infinite loops where the loop head is not at the lowest bci. > > I decided to simply special case infinite loops. I detect if we have an outer loop with the same head as an inner loop. This should not happen, as we must have merged those backedges. Except if it is an infinite loop: We can break the scan, as we have already reached the loop's head. This pull request has now been integrated. Changeset: da38d43f Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/da38d43fcc640ea9852db6c7c23817dcef7080d5 Stats: 252 lines in 3 files changed: 252 ins; 0 del; 0 mod 8296412: Special case infinite loops with unmerged backedges in IdealLoopTree::check_safepts Reviewed-by: chagedorn, kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11706 From mbaesken at openjdk.org Mon Dec 19 13:39:07 2022 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 19 Dec 2022 13:39:07 GMT Subject: RFR: JDK-8299022: Linux ppc64le build issues after JDK-8160404 [v2] In-Reply-To: References: Message-ID: > Looks like [JDK-8160404](https://bugs.openjdk.org/browse/JDK-8160404) caused issues in the Linux ppc64le build. > We now run into Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: linux s390x seems to have the same issues ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11719/files - new: https://git.openjdk.org/jdk/pull/11719/files/9ea4a067..065c1081 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11719&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11719&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11719.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11719/head:pull/11719 PR: https://git.openjdk.org/jdk/pull/11719 From kvn at openjdk.org Mon Dec 19 15:27:51 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 19 Dec 2022 15:27:51 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference [v2] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 08:36:57 GMT, Damon Fenacci wrote: >> `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. >> Checking for NULL reference before checking if blob is a method. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8297801: print message when address invalid Good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11697 From mdoerr at openjdk.org Mon Dec 19 15:37:51 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 19 Dec 2022 15:37:51 GMT Subject: RFR: JDK-8299022: Linux ppc64le and s390x build issues after JDK-8160404 [v2] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 13:39:07 GMT, Matthias Baesken wrote: >> Looks like [JDK-8160404](https://bugs.openjdk.org/browse/JDK-8160404) caused issues in the Linux ppc64le build. >> We now run into > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > linux s390x seems to have the same issues LGTM. Thanks for fixing! ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/11719 From duke at openjdk.org Mon Dec 19 15:41:51 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 15:41:51 GMT Subject: RFR: 8297801: printnm crashes with invalid address due to null pointer dereference [v2] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 08:49:50 GMT, Tobias Hartmann wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8297801: print message when address invalid > > Still looks good. @TobiHartmann @chhagedorn @vnkozlov thanks again for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/11697 From duke at openjdk.org Mon Dec 19 15:47:53 2022 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 19 Dec 2022 15:47:53 GMT Subject: Integrated: 8297801: printnm crashes with invalid address due to null pointer dereference In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 16:10:07 GMT, Damon Fenacci wrote: > `printnm` crashes if you pass an invalid address. This is caused by a null pointer dereference. > Checking for NULL reference before checking if blob is a method. This pull request has now been integrated. Changeset: de0ce792 Author: Damon Fenacci Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/de0ce792c1865f80b6bcfce6741681cb74d75cef Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod 8297801: printnm crashes with invalid address due to null pointer dereference Reviewed-by: thartmann, chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11697 From lucy at openjdk.org Mon Dec 19 16:05:50 2022 From: lucy at openjdk.org (Lutz Schmidt) Date: Mon, 19 Dec 2022 16:05:50 GMT Subject: RFR: JDK-8299022: Linux ppc64le and s390x build issues after JDK-8160404 [v2] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 13:39:07 GMT, Matthias Baesken wrote: >> Looks like [JDK-8160404](https://bugs.openjdk.org/browse/JDK-8160404) caused issues in the Linux ppc64le build. >> We now run into > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > linux s390x seems to have the same issues LGTM. Verified on s390x. Thanks for fixing. ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.org/jdk/pull/11719 From mbaesken at openjdk.org Mon Dec 19 16:21:00 2022 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 19 Dec 2022 16:21:00 GMT Subject: RFR: JDK-8299022: Linux ppc64le and s390x build issues after JDK-8160404 [v2] In-Reply-To: References: Message-ID: <2qtrqEHSXdzHDuaBqXpGQSDqYs_Z67LWBJ40qJ6_52w=.99397b85-fd64-4d95-8050-8874ca40433e@github.com> On Mon, 19 Dec 2022 13:39:07 GMT, Matthias Baesken wrote: >> Looks like [JDK-8160404](https://bugs.openjdk.org/browse/JDK-8160404) caused issues in the Linux ppc64le build. >> We now run into > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > linux s390x seems to have the same issues Hi Lutz and Martin, thanks for the reviews ! ------------- PR: https://git.openjdk.org/jdk/pull/11719 From mdoerr at openjdk.org Mon Dec 19 16:23:30 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 19 Dec 2022 16:23:30 GMT Subject: [jdk20] RFR: 8298947: compiler/codecache/MHIntrinsicAllocFailureTest.java fails intermittently Message-ID: The test doesn't work when using -XX:TieredStopAtLevel to switch off C2 compilation. Don't run the test in this case. ------------- Commit messages: - 8298947: compiler/codecache/MHIntrinsicAllocFailureTest.java fails intermittently Changes: https://git.openjdk.org/jdk20/pull/55/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=55&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298947 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk20/pull/55.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/55/head:pull/55 PR: https://git.openjdk.org/jdk20/pull/55 From mbaesken at openjdk.org Mon Dec 19 16:23:53 2022 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 19 Dec 2022 16:23:53 GMT Subject: Integrated: JDK-8299022: Linux ppc64le and s390x build issues after JDK-8160404 In-Reply-To: References: Message-ID: <90SQ4IdvP99VN_h0rNVzzKf4zIbM47y_yVEmsbaResM=.ce4da59d-596a-4628-8f62-a90a793b2305@github.com> On Mon, 19 Dec 2022 12:12:00 GMT, Matthias Baesken wrote: > Looks like [JDK-8160404](https://bugs.openjdk.org/browse/JDK-8160404) caused issues in the Linux ppc64le build. > We now run into This pull request has now been integrated. Changeset: 756a06d4 Author: Matthias Baesken URL: https://git.openjdk.org/jdk/commit/756a06d4c239966ed68bbbe8ee4c6b6d02154c02 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod 8299022: Linux ppc64le and s390x build issues after JDK-8160404 Reviewed-by: mdoerr, lucy ------------- PR: https://git.openjdk.org/jdk/pull/11719 From kvn at openjdk.org Mon Dec 19 22:22:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 19 Dec 2022 22:22:50 GMT Subject: RFR: 8298913: Add override qualifiers to Relocation classes In-Reply-To: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> References: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> Message-ID: <5eUW4hZiiDAXhkXae3eekYY8g7077dGfpUniGLDiVbI=.bb068ebb-b6c5-4553-ae5d-edd3dd84d673@github.com> On Mon, 19 Dec 2022 08:58:24 GMT, Kim Barrett wrote: > Please review this change to the Relocation classes, adding `override` > qualifiers to all overriding virtual function declarations. > > Testing: > mach5 tier1 > > Note that building with clang (on macosx) subjected these changes to clang's > `-Winconsistent-missing-override` option. > > Note that there are a lot of member functions missing `const` qualifiers in > the Relocation hierarchy. I may look at improving const-correctness later. > After doing a little exploration down that path, it looks like it might be a > bit messy. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11716 From kvn at openjdk.org Mon Dec 19 22:39:48 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 19 Dec 2022 22:39:48 GMT Subject: [jdk20] RFR: 8298947: compiler/codecache/MHIntrinsicAllocFailureTest.java fails intermittently In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 16:17:34 GMT, Martin Doerr wrote: > The test doesn't work when using -XX:TieredStopAtLevel to switch off C2 compilation. Don't run the test in this case. I thought to suggest use `@requires vm.compiler2.enabled` to check if C2 is enabled. But checking for `TieredStopAtLevel` will work for all cases. Approved. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/55 From duke at openjdk.org Tue Dec 20 01:36:58 2022 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 20 Dec 2022 01:36:58 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 06:45:26 GMT, Tobias Hartmann wrote: >> Hi all, >> For C2, convert double to float cause a loss of precision, >> >>

>> ./chaitin.cpp:221
>> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
>> 
>> >> Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: >> >>

>> ./coalesce.cpp:379
>> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>>    ...
>> }
>> 
>> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. >> >> There are two cases that I tested for SPECjvm2008 crypto.aes. >> case 1: >>

>> //chaitin.cpp:221
>> // fcvt.s.d $f0,$f0 #double->float
>> d = 16.994714324523816
>> f = 16.9947147
>> 
>> //coalesce.cpp:379
>> // fcvt.d.s $f0,$f0 #float->double
>> // fcmp.sle.d $fcc2,$f0,$f1
>> (gdb) i r fa0
>> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
>> (gdb) i r fa1
>> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
>> 
>> >> case2: >>

>> //chaitin.cpp:221
>> // fcvt.s.d $f0,$f0
>> d = 16.996332681816536
>> f = 16.9963322
>> 
>> //coalesce.cpp
>> // fcvt.d.s $f0,$f0
>> // fcmp.sle.d $fcc2,$f0,$f1
>> (gdb) i r fa0
>> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
>> (gdb) i r fa1
>> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
>> 
>> >> The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. >> >> This is a patch to fix this problem. Please help review it. >> >> Thanks, >> Sun Guoyun > > Tests passed and performance results looks good. I think the other change that you proposed should not be part of this. @TobiHartmann Thank you for your review. I have one more question for you, How did you test SPECjvm2008 performance? take the maximum or average value of multiple test results? ------------- PR: https://git.openjdk.org/jdk/pull/11685 From xgong at openjdk.org Tue Dec 20 02:09:53 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 20 Dec 2022 02:09:53 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: <-Q_7yKRsu0Ei2db8zQSNj2pPEtjEgaKrx6W4bDzd1KM=.1804e645-4cee-4e41-9f0b-5b43a47a178f@github.com> References: <-Q_7yKRsu0Ei2db8zQSNj2pPEtjEgaKrx6W4bDzd1KM=.1804e645-4cee-4e41-9f0b-5b43a47a178f@github.com> Message-ID: <6tGCsf_5We8CRyMKwbY3Q6KT9Bifx2L4DU6vynPgBjY=.ca826521-4825-4f4d-9a4b-54b978b78370@github.com> On Mon, 19 Dec 2022 08:22:20 GMT, Dmitry Chuyko wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 13856: >> >>> 13854: cmovI_reg_imm0_lt(dst, src, cr); >>> 13855: %} >>> 13856: %} >> >> It sounds more sense to me if the ideal can make sure the constant input is the right child like what you did in the last commit. So that we only need the first rule like other commutative ops. > >> It sounds more sense to me if the ideal can make sure the constant input is the right child like what you did in the last commit. So that we only need the first rule like other commutative ops. > > To achieve that we will have to change max(max[...]) and max(add[...]) optimizations. For example the latter likely will have to look into both inputs. I.e. if we order Max to the left and Con to the right, Add can appear at any side etc. - any fixed order for 2 of 3 means not guaranteed order for the third one. So how about doing the constant swap at the start of the `ideal`, before all other optimizations (i.e. `max(max[...])` and `max(add[...])`). I think it can also benefit other optimizations. E.g. for this optimization: MaxI1(MaxI2(a, b), c) ==> MaxI1(a, MaxI2(b, c)) If "`b`" is a constant in `MaxI2(a, b)`, and `c` is a constant in `MaxI1(...)`, After the above optimization, `MaxI2(b, c)` can be constant folding into a constant, which is good. But without the constant swap, (i.e. `b` and `c` may not be a constant), the constant folding will be missed for some cases. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From xliu at openjdk.org Tue Dec 20 07:25:47 2022 From: xliu at openjdk.org (Xin Liu) Date: Tue, 20 Dec 2022 07:25:47 GMT Subject: RFR: 8299061: Using lambda to optimize GraphKit::compute_stack_effects() Message-ID: Some bytecodes donot need rsize in GraphKit::compute_stack_effects(). This change defines rsize as a lambda and avoid computation if it's not in use. We utilize CTW($openjdk/test/hotspot/jtreg/testlibrary/ctw/dist) to test this patch. We compile 2 builtin modules: java.base and jdk.compiler(javac). JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:java.base JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:jdk.compiler For java.base module, the average compilation speed increases from 12011 bytes/s to 12301 bytes/s, or +2.41%. For jdk.compiler module, the average compilation speed is almost same(+0.16%) ------------- Commit messages: - 8299061: Using lambda to optimize GraphKit::compute_stack_effects() Changes: https://git.openjdk.org/jdk/pull/11737/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11737&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299061 Stats: 18 lines in 1 file changed: 8 ins; 3 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/11737.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11737/head:pull/11737 PR: https://git.openjdk.org/jdk/pull/11737 From aboldtch at openjdk.org Tue Dec 20 08:52:48 2022 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Tue, 20 Dec 2022 08:52:48 GMT Subject: RFR: 8298913: Add override qualifiers to Relocation classes In-Reply-To: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> References: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> Message-ID: <89rqQA5-kLnb3QDTsLiaQHt6_FE98xKXR7yHY3JXSCM=.7b623538-219d-4716-b574-77bd34c90455@github.com> On Mon, 19 Dec 2022 08:58:24 GMT, Kim Barrett wrote: > Please review this change to the Relocation classes, adding `override` > qualifiers to all overriding virtual function declarations. > > Testing: > mach5 tier1 > > Note that building with clang (on macosx) subjected these changes to clang's > `-Winconsistent-missing-override` option. > > Note that there are a lot of member functions missing `const` qualifiers in > the Relocation hierarchy. I may look at improving const-correctness later. > After doing a little exploration down that path, it looks like it might be a > bit messy. lgtm. ------------- Marked as reviewed by aboldtch (Committer). PR: https://git.openjdk.org/jdk/pull/11716 From chagedorn at openjdk.org Tue Dec 20 09:38:56 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 20 Dec 2022 09:38:56 GMT Subject: [jdk20] RFR: 8298947: compiler/codecache/MHIntrinsicAllocFailureTest.java fails intermittently In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 16:17:34 GMT, Martin Doerr wrote: > The test doesn't work when using -XX:TieredStopAtLevel to switch off C2 compilation. Don't run the test in this case. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/55 From kbarrett at openjdk.org Tue Dec 20 09:43:49 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 20 Dec 2022 09:43:49 GMT Subject: RFR: 8299061: Using lambda to optimize GraphKit::compute_stack_effects() In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 07:18:05 GMT, Xin Liu wrote: > Some bytecodes donot need rsize in GraphKit::compute_stack_effects(). > This change defines rsize as a lambda and avoid computation if it's not in use. > > We utilize CTW($openjdk/test/hotspot/jtreg/testlibrary/ctw/dist) to test this patch. > We compile 2 builtin modules: java.base and jdk.compiler(javac). > > > JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:java.base > JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:jdk.compiler > > > For java.base module, the average compilation speed increases from 12011 bytes/s to 12301 bytes/s, or +2.41%. > For jdk.compiler module, the average compilation speed is almost same(+0.16%) > > > | java.base | before | after |diff | > |---------------------------|-----------|-----------|-------| > | C2 Compile Time(s) | 333.767 | 325.939 | | > | Parse Time(s) | 123.076 | 120.976 | | > | ratio | 36.87% | 37.12% | | > | throughput(bytes/s)* | 12073.418 | 12368.028 | 2.44% | > | Average compilation speed | 12011 | 12301 | 2.41% | > | Total compiled methods | 58151 | 58145 | | > > > | jdk.compiler | before | after |diff | > |---------------------------|-----------|-----------|-------| > | C2 Compile Time(s) | 66.313 | 66.196 | | > | Parse Time(s) | 21.681 | 21.445 | | > | ratio | 32.69% | 32.40% | | > | throughput(bytes/s) | 14350.804 | 14399.76 | 0.34% | > | Average compilation speed | 13255 | 13276 | 0.16% | > | Total compiled methods | 13729 | 13733 | | > > > *Throughtput is reported from Tier4 row. It's very close to 'Average compilation speed' but not exactly same. eg. here is from java.base module. > > > before: > Tier4 {speed: 12073.418 bytes/s; standard: 332.362 s, 4004535 bytes, 58121 methods; osr: 0.327 s, 12154 bytes, 30 methods; nmethods_size: 52104768 bytes; nmethods_code_size: 32117912 bytes} > after: > Tier4 {speed: 12368.028 bytes/s; standard: 324.500 s, 4004111 bytes, 58113 methods; osr: 0.344 s, 13565 bytes, 32 methods; nmethods_size: 52110000 bytes; nmethods_code_size: 32118984 bytes} Changes requested by kbarrett (Reviewer). src/hotspot/share/opto/graphKit.cpp line 1027: > 1025: > 1026: auto rsize = [&]() { > 1027: BasicType rtype = T_ILLEGAL; Rather than initializing `rtype` to a dummy value and then almost immediately assigning `rtype`, just initialize `rtype` properly to begin with. src/hotspot/share/opto/graphKit.cpp line 1032: > 1030: rtype = Bytecodes::result_type(code); // checkcast=P, athrow=V > 1031: if (rtype < T_CONFLICT) > 1032: sz = type2size[rtype]; [pre-existing] Missing braces around the consequent. Maybe write this instead as if (rtype < T_CONFLICT) { return type2size[rtype]; } else { return 0; } avoiding the introduction of the `sz` variable. ------------- PR: https://git.openjdk.org/jdk/pull/11737 From mdoerr at openjdk.org Tue Dec 20 10:02:47 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 20 Dec 2022 10:02:47 GMT Subject: [jdk20] RFR: 8298947: compiler/codecache/MHIntrinsicAllocFailureTest.java fails intermittently In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 16:17:34 GMT, Martin Doerr wrote: > The test doesn't work when using -XX:TieredStopAtLevel to switch off C2 compilation. Don't run the test in this case. Thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk20/pull/55 From dchuyko at openjdk.org Tue Dec 20 11:25:49 2022 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Tue, 20 Dec 2022 11:25:49 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: <6tGCsf_5We8CRyMKwbY3Q6KT9Bifx2L4DU6vynPgBjY=.ca826521-4825-4f4d-9a4b-54b978b78370@github.com> References: <-Q_7yKRsu0Ei2db8zQSNj2pPEtjEgaKrx6W4bDzd1KM=.1804e645-4cee-4e41-9f0b-5b43a47a178f@github.com> <6tGCsf_5We8CRyMKwbY3Q6KT9Bifx2L4DU6vynPgBjY=.ca826521-4825-4f4d-9a4b-54b978b78370@github.com> Message-ID: On Tue, 20 Dec 2022 02:07:23 GMT, Xiaohong Gong wrote: >>> It sounds more sense to me if the ideal can make sure the constant input is the right child like what you did in the last commit. So that we only need the first rule like other commutative ops. >> >> To achieve that we will have to change max(max[...]) and max(add[...]) optimizations. For example the latter likely will have to look into both inputs. I.e. if we order Max to the left and Con to the right, Add can appear at any side etc. - any fixed order for 2 of 3 means not guaranteed order for the third one. > > So how about doing the constant swap at the start of the `ideal`, before all other optimizations (i.e. `max(max[...])` and `max(add[...])`). I think it can also benefit other optimizations. E.g. for this optimization: > > MaxI1(MaxI2(a, b), c) ==> MaxI1(a, MaxI2(b, c)) > > If "`b`" is a constant in `MaxI2(a, b)`, and `c` is a constant in `MaxI1(...)`, After the above optimization, `MaxI2(b, c)` can be constant folding into a constant, which is good. But without the constant swap, (i.e. `b` and `c` may not be a constant), the constant folding will be missed for some cases. Other optimizations up in the tree look into already finished nodes. As I said above we can't always order all of three Min/Max, Add, Con so they will have to duplicate search patterns. I would prefer to order Con to the right and maybe Min/Max to the left as well, but that feels like a separate improvement, after that reverted rules can be easily excluded from generated code. ------------- PR: https://git.openjdk.org/jdk/pull/11570 From bulasevich at openjdk.org Tue Dec 20 13:09:59 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 20 Dec 2022 13:09:59 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: References: Message-ID: <25w5VgZQna7ak8j2UUy4Bx-1xgcydRsPU-JCw0j7esk=.ce414c18-8c12-496e-a8fd-8df781b39e19@github.com> On Mon, 3 Oct 2022 20:46:07 GMT, Dean Long wrote: >>> What is the performance impact of making several of the methods virtual? >> >> Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. > >> > What is the performance impact of making several of the methods virtual? >> >> Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. > > I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. @dean-long @vnkozlov I eliminated virtual methods, changed the implementation internals and added tests. Let me once again ask you to review. Thanks ------------- PR: https://git.openjdk.org/jdk/pull/10025 From roland at openjdk.org Tue Dec 20 18:54:54 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 20 Dec 2022 18:54:54 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears Message-ID: As described in https://github.com/openjdk/jdk20/pull/22, the bug is caused by the iv phi of a post loop that becomes top but because the post loop is guarded by an opaque node, the control flow remains alive. The fix I propose is based on this comment Vladimir made: https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) in2), it runs Value() on (CmpI in1 in2) and if it constant folds so that the loop is not taken, returns that result. Translating "loop not taken" into an actual CmpI type depends on whether the loop goes up or down. To make the check above possible, OpaqueZeroTripGuard includes the BoolTest::mask that causes the loop to be executed at the zero trip guard. The new logic in CmpINode::Value() is executed for both the main and post loop zero trip guards (while the bug was only seen AFAIK with the post loop) because I expect the same bug to exist with the main loop. For the main loop, this works because initially the loop should be executed and as optimizations proceed and adjust the zero trip guard, the range of iterations executed in the loop should narrow (and never widen). We may then end up with no iterations executed in the loop. No further optimizations would make the main loop executable again. It's then fine to fold the zero trip guard as we're done with optimizations. This works for the post loop because the compiler has no way to tell whether it's executed or not as long as there's a main loop: the zero trip guard then takes as input a phi that merges the pre and main loop ivs. For the case of a loop going up, the zero trip guard should initially test whether [init, limit] (the type of phi) is stricly less than limit. The compiler can't decide what the result of that test is. As optimizations proceed, the [init, limit] range could become narrower as I understand and there's no risk for the compiler to report the post loop as not taken. I still believe it's risky to simply drop the OpaqueZeroTripGuard for the post loop even if it can't constant fold at least because we wouldn't want the zero trip guard to split thru phi. ------------- Commit messages: - fix & test Changes: https://git.openjdk.org/jdk20/pull/65/files Webrev: https://webrevs.openjdk.org/?repo=jdk20&pr=65&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298176 Stats: 191 lines in 7 files changed: 188 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk20/pull/65.diff Fetch: git fetch https://git.openjdk.org/jdk20 pull/65/head:pull/65 PR: https://git.openjdk.org/jdk20/pull/65 From roland at openjdk.org Tue Dec 20 19:10:50 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 20 Dec 2022 19:10:50 GMT Subject: RFR: 8298848: C2: clone all of (CmpP (LoadKlass (AddP down at split if In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 10:53:37 GMT, Roland Westrelin wrote: > As suggested by Vladimir in: > https://github.com/openjdk/jdk/pull/11666 > > Thus extract one for the fixes as a separate PR. The bug as described > in the above PR is: > > The crash occurs because a` (If (Bool (CmpP (LoadKlass ..))))` > only has a single projection. It lost the other projection because of > a `CheckCastPP` that becomes `top`. Initially the pattern is, in pseudo > code: > > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > > `obj` itself is a `CheckCastPP` that's pinned at a dominating if. That > dominating if goes through split through phi. The `LoadKlass` for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > > with` phi1 = (Phi (LoadKlass obj) (LoadKlass obj))` and phi2 = (Phi obj obj) > with `obj = (CheckCastPP#2 obj')` > > `PhiNode::Ideal()` transforms `phi2` into a new `CheckCastPP`: > `(CheckCastPP#3 obj' obj') `with control set to the region right above > the if in the pseudo code above. There happens to be another > `CheckCastPP` at the same control which casts obj' to a narrower > type. So the new `CheckCastPP#3` is replaced by that one (because of > `ConstraintCastNode::dominating_cast()`) and pseudo code becomes: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > > and then: > > > if (phi1 == some_class) { > obj = top; > } > > > because the types of the 2 `CheckCastPP`s conflict. That would be ok if: > > `phi1 == some_class` > > would constant fold. It would if the test was: > > `if (CheckCastPP#4(obj').klass == some_klass) { > ` > but because of split if, the `(CmpP (LoadKlass ..))` and the > `CheckCastPP#1` ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > `(Bool (CmpP (LoadKlass (AddP ..))))` > > down the same way `(Bool (CmpP ..))` is cloned down. After split if, the > pseudo code becomes: > > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > > The bug can't occur because the `CheckCastPP` and` (CmpP (LoadKlass ..))` > operate on the same phi input. The change in split_if.cpp implements > that. Anyone else for this one? (#11666 depends on it and I prepared #11673 on top of #11666 to avoid merge conflicts so it indirectly depends on it too). ------------- PR: https://git.openjdk.org/jdk/pull/11689 From xliu at openjdk.org Tue Dec 20 19:55:31 2022 From: xliu at openjdk.org (Xin Liu) Date: Tue, 20 Dec 2022 19:55:31 GMT Subject: RFR: 8299061: Using lambda to optimize GraphKit::compute_stack_effects() [v2] In-Reply-To: References: Message-ID: > Some bytecodes donot need rsize in GraphKit::compute_stack_effects(). > This change defines rsize as a lambda and avoid computation if it's not in use. > > We utilize CTW($openjdk/test/hotspot/jtreg/testlibrary/ctw/dist) to test this patch. > We compile 2 builtin modules: java.base and jdk.compiler(javac). > > > JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:java.base > JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:jdk.compiler > > > For java.base module, the average compilation speed increases from 12011 bytes/s to 12301 bytes/s, or +2.41%. > For jdk.compiler module, the average compilation speed is almost same(+0.16%) > > > | java.base | before | after |diff | > |---------------------------|-----------|-----------|-------| > | C2 Compile Time(s) | 333.767 | 325.939 | | > | Parse Time(s) | 123.076 | 120.976 | | > | ratio | 36.87% | 37.12% | | > | throughput(bytes/s)* | 12073.418 | 12368.028 | 2.44% | > | Average compilation speed | 12011 | 12301 | 2.41% | > | Total compiled methods | 58151 | 58145 | | > > > | jdk.compiler | before | after |diff | > |---------------------------|-----------|-----------|-------| > | C2 Compile Time(s) | 66.313 | 66.196 | | > | Parse Time(s) | 21.681 | 21.445 | | > | ratio | 32.69% | 32.40% | | > | throughput(bytes/s) | 14350.804 | 14399.76 | 0.34% | > | Average compilation speed | 13255 | 13276 | 0.16% | > | Total compiled methods | 13729 | 13733 | | > > > *Throughtput is reported from Tier4 row. It's very close to 'Average compilation speed' but not exactly same. eg. here is from java.base module. > > > before: > Tier4 {speed: 12073.418 bytes/s; standard: 332.362 s, 4004535 bytes, 58121 methods; osr: 0.327 s, 12154 bytes, 30 methods; nmethods_size: 52104768 bytes; nmethods_code_size: 32117912 bytes} > after: > Tier4 {speed: 12368.028 bytes/s; standard: 324.500 s, 4004111 bytes, 58113 methods; osr: 0.344 s, 13565 bytes, 32 methods; nmethods_size: 52110000 bytes; nmethods_code_size: 32118984 bytes} Xin Liu has updated the pull request incrementally with one additional commit since the last revision: further simplify the lambda based on the feedbacks of reviewers. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11737/files - new: https://git.openjdk.org/jdk/pull/11737/files/77c09f3f..6928d36e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11737&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11737&range=00-01 Stats: 8 lines in 1 file changed: 0 ins; 5 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11737.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11737/head:pull/11737 PR: https://git.openjdk.org/jdk/pull/11737 From mdoerr at openjdk.org Tue Dec 20 22:06:52 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 20 Dec 2022 22:06:52 GMT Subject: [jdk20] Integrated: 8298947: compiler/codecache/MHIntrinsicAllocFailureTest.java fails intermittently In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 16:17:34 GMT, Martin Doerr wrote: > The test doesn't work when using -XX:TieredStopAtLevel to switch off C2 compilation. Don't run the test in this case. This pull request has now been integrated. Changeset: 3d4d9fd6 Author: Martin Doerr URL: https://git.openjdk.org/jdk20/commit/3d4d9fd6e6de037950f94482d4e33f178eb15daa Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8298947: compiler/codecache/MHIntrinsicAllocFailureTest.java fails intermittently Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk20/pull/55 From xliu at openjdk.org Tue Dec 20 23:33:51 2022 From: xliu at openjdk.org (Xin Liu) Date: Tue, 20 Dec 2022 23:33:51 GMT Subject: RFR: 8299061: Using lambda to optimize GraphKit::compute_stack_effects() [v2] In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 09:35:17 GMT, Kim Barrett wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> further simplify the lambda based on the feedbacks of reviewers. > > src/hotspot/share/opto/graphKit.cpp line 1032: > >> 1030: rtype = Bytecodes::result_type(code); // checkcast=P, athrow=V >> 1031: if (rtype < T_CONFLICT) >> 1032: sz = type2size[rtype]; > > [pre-existing] Missing braces around the consequent. Maybe write this instead as > > if (rtype < T_CONFLICT) { > return type2size[rtype]; > } else { > return 0; > } > > avoiding the introduction of the `sz` variable. make sense. I should be an expression. ------------- PR: https://git.openjdk.org/jdk/pull/11737 From kbarrett at openjdk.org Wed Dec 21 00:58:49 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 21 Dec 2022 00:58:49 GMT Subject: RFR: 8299061: Using lambda to optimize GraphKit::compute_stack_effects() [v2] In-Reply-To: References: Message-ID: <90PQahaxmtt780D9D9_xEF8fxBGcawM5vWgWv2kuTGI=.82bd73c3-c52f-4597-a177-6103c777b3fd@github.com> On Tue, 20 Dec 2022 19:55:31 GMT, Xin Liu wrote: >> Some bytecodes donot need rsize in GraphKit::compute_stack_effects(). >> This change defines rsize as a lambda and avoid computation if it's not in use. >> >> We utilize CTW($openjdk/test/hotspot/jtreg/testlibrary/ctw/dist) to test this patch. >> We compile 2 builtin modules: java.base and jdk.compiler(javac). >> >> >> JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:java.base >> JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:jdk.compiler >> >> For jdk.compiler module, the average compilation speed increases from 13236 bytes/s to 13303 bytes/s, or +0.51%. >> For java.base module, the average compilation speed is almost same(+0.3%) >> >> >> | java.base | before | after | diff | >> |---------------------------|-----------|-----------|-------| >> | C2 Compile Time(s) | 341.218 | 339.75 | | >> | Parse Time(s) | 125.403 | 125.034 | | >> | ratio of parse | 36.75% | 36.80% | | >> | throughput(bytes/s) | 11841.628* | 11876.831 | 0.30% | >> | Average compilation speed | 11779 | 11816 | 0.31% | >> | Total compiled methods | 58148 | 58158 | | >> | | | | | >> >> >> | jdk.compiler | before | after | | >> |---------------------------|-----------|-----------|-------| >> | C2 Compile Time(s) | 66.486 | 66.087 | | >> | Parse Time(s) | 21.877 | 21.731 | | >> | ratio of parse | 32.90% | 32.88% | | >> | throughput(bytes/s) | 14342.883 | 14436.494 | 0.65% | >> | Average compilation speed | 13236 | 13303 | 0.51% | >> | Total compiled methods | 13734 | 13730 | | >> | | | | | >> >> >> *Throughtput is reported from Tier4 row. It's very close to 'Average compilation speed' but not exactly same. eg. here is from java.base module. >> >> >> before: >> Tier4 {speed: 11841.628 bytes/s; standard: 339.754 s, 4014210 bytes, 58115 methods; osr: 0.345 s, 13109 bytes, 33 methods; nmethods_size: 52147560 bytes; nmethods_code_size: 32146360 bytes} >> after: >> Tier4 {speed: 11876.831 bytes/s; standard: 338.350 s, 4009949 bytes, 58126 methods; osr: 0.368 s, 12951 bytes, 32 methods; nmethods_size: 52122744 bytes; nmethods_code_size: 32124952 bytes} > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > further simplify the lambda based on the feedbacks of reviewers. Looks good. ------------- Marked as reviewed by kbarrett (Reviewer). PR: https://git.openjdk.org/jdk/pull/11737 From kbarrett at openjdk.org Wed Dec 21 01:26:16 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 21 Dec 2022 01:26:16 GMT Subject: Integrated: 8298913: Add override qualifiers to Relocation classes In-Reply-To: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> References: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> Message-ID: On Mon, 19 Dec 2022 08:58:24 GMT, Kim Barrett wrote: > Please review this change to the Relocation classes, adding `override` > qualifiers to all overriding virtual function declarations. > > Testing: > mach5 tier1 > > Note that building with clang (on macosx) subjected these changes to clang's > `-Winconsistent-missing-override` option. > > Note that there are a lot of member functions missing `const` qualifiers in > the Relocation hierarchy. I may look at improving const-correctness later. > After doing a little exploration down that path, it looks like it might be a > bit messy. This pull request has now been integrated. Changeset: 396a9bff Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/396a9bff68cd25331ff88927264eae51c583bf48 Stats: 58 lines in 1 file changed: 0 ins; 0 del; 58 mod 8298913: Add override qualifiers to Relocation classes Reviewed-by: kvn, aboldtch ------------- PR: https://git.openjdk.org/jdk/pull/11716 From kbarrett at openjdk.org Wed Dec 21 01:26:16 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 21 Dec 2022 01:26:16 GMT Subject: RFR: 8298913: Add override qualifiers to Relocation classes [v2] In-Reply-To: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> References: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> Message-ID: > Please review this change to the Relocation classes, adding `override` > qualifiers to all overriding virtual function declarations. > > Testing: > mach5 tier1 > > Note that building with clang (on macosx) subjected these changes to clang's > `-Winconsistent-missing-override` option. > > Note that there are a lot of member functions missing `const` qualifiers in > the Relocation hierarchy. I may look at improving const-correctness later. > After doing a little exploration down that path, it looks like it might be a > bit messy. Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into reloc-override - add override qualifiers ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11716/files - new: https://git.openjdk.org/jdk/pull/11716/files/236514d5..26897139 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11716&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11716&range=00-01 Stats: 3667 lines in 194 files changed: 2118 ins; 618 del; 931 mod Patch: https://git.openjdk.org/jdk/pull/11716.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11716/head:pull/11716 PR: https://git.openjdk.org/jdk/pull/11716 From kbarrett at openjdk.org Wed Dec 21 01:26:16 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 21 Dec 2022 01:26:16 GMT Subject: RFR: 8298913: Add override qualifiers to Relocation classes [v2] In-Reply-To: <5eUW4hZiiDAXhkXae3eekYY8g7077dGfpUniGLDiVbI=.bb068ebb-b6c5-4553-ae5d-edd3dd84d673@github.com> References: <_ylCgIqLjNGlA1r5JKQDqBGGM2IXjkSMjGBCFIgTRrs=.676f706b-499e-450b-a593-5b6109db1b2e@github.com> <5eUW4hZiiDAXhkXae3eekYY8g7077dGfpUniGLDiVbI=.bb068ebb-b6c5-4553-ae5d-edd3dd84d673@github.com> Message-ID: On Mon, 19 Dec 2022 22:20:01 GMT, Vladimir Kozlov wrote: >> Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into reloc-override >> - add override qualifiers > > Looks good. Thanks for reviews @vnkozlov and @xmas92 . ------------- PR: https://git.openjdk.org/jdk/pull/11716 From haosun at openjdk.org Wed Dec 21 01:26:49 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 21 Dec 2022 01:26:49 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long [v2] In-Reply-To: References: Message-ID: <_c9HdvooTAkAfYKzZlIF1R6UNu5CQKyx3Mw8Xc3E5A0=.7bd54818-7209-4006-849d-36135a18c955@github.com> On Mon, 28 Nov 2022 09:58:31 GMT, Andrew Dinn wrote: >> Hao Sun has updated the pull request incrementally with one additional commit since the last revision: >> >> immIAddSub is always positive >> >> As commented by aph, "immIAddSub" is always positive and we needn't >> check the signedness. >> >> Besides, more "comparing reg with imm" test cases are added. > > Looks good. Would you mind taking another look at the latest commit as suggested by aph? Thanks. @adinn ------------- PR: https://git.openjdk.org/jdk/pull/11383 From xgong at openjdk.org Wed Dec 21 01:55:48 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 21 Dec 2022 01:55:48 GMT Subject: RFR: JDK-8153837: AArch64: Handle special cases for MaxINode & MinINode [v2] In-Reply-To: References: <-Q_7yKRsu0Ei2db8zQSNj2pPEtjEgaKrx6W4bDzd1KM=.1804e645-4cee-4e41-9f0b-5b43a47a178f@github.com> <6tGCsf_5We8CRyMKwbY3Q6KT9Bifx2L4DU6vynPgBjY=.ca826521-4825-4f4d-9a4b-54b978b78370@github.com> Message-ID: On Tue, 20 Dec 2022 11:23:33 GMT, Dmitry Chuyko wrote: >> So how about doing the constant swap at the start of the `ideal`, before all other optimizations (i.e. `max(max[...])` and `max(add[...])`). I think it can also benefit other optimizations. E.g. for this optimization: >> >> MaxI1(MaxI2(a, b), c) ==> MaxI1(a, MaxI2(b, c)) >> >> If "`b`" is a constant in `MaxI2(a, b)`, and `c` is a constant in `MaxI1(...)`, After the above optimization, `MaxI2(b, c)` can be constant folding into a constant, which is good. But without the constant swap, (i.e. `b` and `c` may not be a constant), the constant folding will be missed for some cases. > > Other optimizations up in the tree look into already finished nodes. As I said above we can't always order all of three Min/Max, Add, Con so they will have to duplicate search patterns. I would prefer to order Con to the right and maybe Min/Max to the left as well, but that feels like a separate improvement, after that reverted rules can be easily excluded from generated code. OK, make sense to me. We can have a separate PR for this improvement. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/11570 From duke at openjdk.org Wed Dec 21 02:00:35 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Wed, 21 Dec 2022 02:00:35 GMT Subject: RFR: JDK-8299158: Improve MD5 intrinsic on AArch64 Message-ID: There are two optimizations to reduce the length of the data path. 1) Replace __ eorw(rscratch3, rscratch3, r4); __ addw(rscratch3, rscratch3, rscratch1); __ addw(rscratch3, rscratch3, rscratch4); with __ eorw(rscratch3, rscratch3, r4); __ addw(rscratch4, rscratch4, rscratch1); __ addw(rscratch3, rscratch3, rscratch4); The eorw and the first addw can be computed in parallel. 2) Replace __ eorw(rscratch2, r2, r3); __ andw(rscratch3, rscratch2, r4); __ eorw(rscratch3, rscratch3, r3); with __ andw(rscratch3, r2, r4); __ bicw(rscratch4, r3, r4); __ orrw(rscratch3, rscratch3, rscratch4); The transformation is based on the equation `((r2 ^ r3) & r4) ^ r3 == (r2 & r4) | (r3 & -r4)`. The two subexpressions on RHS can be computed in parallel. Correctness proof r2 r3 r4 (r2 ^ r3) ((r2 ^ r3) & r4) LHS (r2 & r4) (r3 & -r4) RHS 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 0 0 1 1 0 1 The change has been tested by TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. The performance is measured on EC2 m6g instance (Graviton2) and shows 18-25% improvement. Baseline Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 15 2989.149 ? 54.895 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 24.927 ? 0.002 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2433.184 ? 74.616 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 24.736 ? 0.002 ops/ms Optimized Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 15 3719.214 ? 23.087 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 31.280 ? 0.003 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2874.308 ? 88.455 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 31.014 ? 0.060 ops/ms ------------- Commit messages: - transform GG - Reduce the length of data path Changes: https://git.openjdk.org/jdk/pull/11748/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11748&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299158 Stats: 8 lines in 1 file changed: 1 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11748.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11748/head:pull/11748 PR: https://git.openjdk.org/jdk/pull/11748 From xlinzheng at openjdk.org Wed Dec 21 06:19:30 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 21 Dec 2022 06:19:30 GMT Subject: RFR: 8299172: RISC-V: [TESTBUG] Fix stack alignment logic in jvmci RISCV64TestAssembler.java Message-ID: We observed a failure in JVMCI tests after `-ea -esa` turns on when running `./test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java`. Failure at the line [1]. java.lang.AssertionError at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitGrowStack(RISCV64TestAssembler.java:203) at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitCallPrologue(RISCV64TestAssembler.java:239) ... ... The failure output has been attached to the JBS issue link. To be short, the stack alignment should align with `16`, and we can align with the logic in AArch64 [2] and x86_64 [3]. The x86_64 one is inside a recent change. Tested along with other patches, and the failed test passed. Thanks, Xiaolin [1] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/riscv64/RISCV64TestAssembler.java#L193 [2] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/aarch64/AArch64TestAssembler.java#L273-L285 [3] https://github.com/openjdk/jdk/commit/277f0c24a2e186166bfe70fc93ba79aec10585aa ------------- Commit messages: - Fix compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java Changes: https://git.openjdk.org/jdk/pull/11751/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11751&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299172 Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11751.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11751/head:pull/11751 PR: https://git.openjdk.org/jdk/pull/11751 From thartmann at openjdk.org Wed Dec 21 06:50:48 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 21 Dec 2022 06:50:48 GMT Subject: RFR: 8299061: Using lambda to optimize GraphKit::compute_stack_effects() [v2] In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 19:55:31 GMT, Xin Liu wrote: >> Some bytecodes donot need rsize in GraphKit::compute_stack_effects(). >> This change defines rsize as a lambda and avoid computation if it's not in use. >> >> We utilize CTW($openjdk/test/hotspot/jtreg/testlibrary/ctw/dist) to test this patch. >> We compile 2 builtin modules: java.base and jdk.compiler(javac). >> >> >> JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:java.base >> JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:jdk.compiler >> >> For jdk.compiler module, the average compilation speed increases from 13236 bytes/s to 13303 bytes/s, or +0.51%. >> For java.base module, the average compilation speed is almost same(+0.3%) >> >> >> | java.base | before | after | diff | >> |---------------------------|-----------|-----------|-------| >> | C2 Compile Time(s) | 341.218 | 339.75 | | >> | Parse Time(s) | 125.403 | 125.034 | | >> | ratio of parse | 36.75% | 36.80% | | >> | throughput(bytes/s) | 11841.628* | 11876.831 | 0.30% | >> | Average compilation speed | 11779 | 11816 | 0.31% | >> | Total compiled methods | 58148 | 58158 | | >> | | | | | >> >> >> | jdk.compiler | before | after | | >> |---------------------------|-----------|-----------|-------| >> | C2 Compile Time(s) | 66.486 | 66.087 | | >> | Parse Time(s) | 21.877 | 21.731 | | >> | ratio of parse | 32.90% | 32.88% | | >> | throughput(bytes/s) | 14342.883 | 14436.494 | 0.65% | >> | Average compilation speed | 13236 | 13303 | 0.51% | >> | Total compiled methods | 13734 | 13730 | | >> | | | | | >> >> >> *Throughtput is reported from Tier4 row. It's very close to 'Average compilation speed' but not exactly same. eg. here is from java.base module. >> >> >> before: >> Tier4 {speed: 11841.628 bytes/s; standard: 339.754 s, 4014210 bytes, 58115 methods; osr: 0.345 s, 13109 bytes, 33 methods; nmethods_size: 52147560 bytes; nmethods_code_size: 32146360 bytes} >> after: >> Tier4 {speed: 11876.831 bytes/s; standard: 338.350 s, 4009949 bytes, 58126 methods; osr: 0.368 s, 12951 bytes, 32 methods; nmethods_size: 52122744 bytes; nmethods_code_size: 32124952 bytes} > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > further simplify the lambda based on the feedbacks of reviewers. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11737 From fyang at openjdk.org Wed Dec 21 08:54:50 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 21 Dec 2022 08:54:50 GMT Subject: RFR: 8299172: RISC-V: [TESTBUG] Fix stack alignment logic in jvmci RISCV64TestAssembler.java In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 06:13:04 GMT, Xiaolin Zheng wrote: > We observed a failure in JVMCI tests after `-ea -esa` turned on when running `./test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java`. > > Failure at the line [1]. > > > java.lang.AssertionError > at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitGrowStack(RISCV64TestAssembler.java:203) > at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitCallPrologue(RISCV64TestAssembler.java:239) > ... > ... > > > The failure output has been attached to the JBS issue link. > > To be short, the stack alignment should align with `16`, and we can align with the logic in AArch64 [2] and x86_64 [3]. The x86_64 one is inside a recent change. > > Tested along with other patches, and the failed test passed. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/riscv64/RISCV64TestAssembler.java#L193 > [2] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/aarch64/AArch64TestAssembler.java#L273-L285 > [3] https://github.com/openjdk/jdk/commit/277f0c24a2e186166bfe70fc93ba79aec10585aa Looks reasonable to me. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11751 From adinn at openjdk.org Wed Dec 21 09:28:55 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 21 Dec 2022 09:28:55 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long [v2] In-Reply-To: References: Message-ID: <6nc9y3UFALGmeJVntUOjeEXWIjA9hCYzRlrBoy29Z24=.eb78b987-21d3-4884-818c-af8901aa5596@github.com> On Mon, 5 Dec 2022 12:06:24 GMT, Hao Sun wrote: >> x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. >> >> Note-1: minor style issues are fixed for CmpL3 related rules. >> >> Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. >> >> Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. >> >> Following is the performance data for the JMH case: >> >> >> Before After >> Benchmark (size) Mode Cnt Score Error Score Error Units >> Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op >> Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op >> Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op >> Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > immIAddSub is always positive > > As commented by aph, "immIAddSub" is always positive and we needn't > check the signedness. > > Besides, more "comparing reg with imm" test cases are added. New changes are fine ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11383 From kbarrett at openjdk.org Wed Dec 21 10:24:30 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 21 Dec 2022 10:24:30 GMT Subject: RFR: 8299191: Unnecessarily global friend functions for relocInfo Message-ID: Please review this small cleanup around the relocInfo class. It declares a couple of global functions as friends, so they have access to private constructors and helper functions. But there is no reason for these functions to be at global scope. It is more natural for them to be static factory functions in relocInfo. Testing: mach5 tier1 ------------- Commit messages: - make friend functions instead be static members Changes: https://git.openjdk.org/jdk/pull/11753/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11753&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299191 Stats: 23 lines in 3 files changed: 5 ins; 12 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11753.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11753/head:pull/11753 PR: https://git.openjdk.org/jdk/pull/11753 From chagedorn at openjdk.org Wed Dec 21 11:35:51 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 21 Dec 2022 11:35:51 GMT Subject: RFR: 8299191: Unnecessarily global friend functions for relocInfo In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 10:17:12 GMT, Kim Barrett wrote: > Please review this small cleanup around the relocInfo class. It declares a > couple of global functions as friends, so they have access to private > constructors and helper functions. But there is no reason for these functions > to be at global scope. It is more natural for them to be static factory > functions in relocInfo. > > Testing: > mach5 tier1 Looks good! src/hotspot/share/code/relocInfo.cpp line 89: > 87: } > 88: // cannot compact, so just update the count and return the limit pointer > 89: (*this) = prefix_info(plen); // write new datalen Just a minor thing: Is there a specific reason for these additional whitespaces? ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11753 From luhenry at openjdk.org Wed Dec 21 12:45:50 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Wed, 21 Dec 2022 12:45:50 GMT Subject: RFR: JDK-8299158: Improve MD5 intrinsic on AArch64 In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 01:52:32 GMT, Yi-Fan Tsai wrote: > There are two optimizations to reduce the length of the data path. > 1) Replace > > __ eorw(rscratch3, rscratch3, r4); > __ addw(rscratch3, rscratch3, rscratch1); > __ addw(rscratch3, rscratch3, rscratch4); > > with > > __ eorw(rscratch3, rscratch3, r4); > __ addw(rscratch4, rscratch4, rscratch1); > __ addw(rscratch3, rscratch3, rscratch4); > > The eorw and the first addw can be computed in parallel. > > 2) Replace > > __ eorw(rscratch2, r2, r3); > __ andw(rscratch3, rscratch2, r4); > __ eorw(rscratch3, rscratch3, r3); > > with > > __ andw(rscratch3, r2, r4); > __ bicw(rscratch4, r3, r4); > __ orrw(rscratch3, rscratch3, rscratch4); > > The transformation is based on the equation `((r2 ^ r3) & r4) ^ r3 == (r2 & r4) | (r3 & -r4)`. > The two subexpressions on RHS can be computed in parallel. > > Correctness proof > > r2 r3 r4 (r2 ^ r3) ((r2 ^ r3) & r4) LHS (r2 & r4) (r3 & -r4) RHS > 0 0 0 0 0 0 0 0 0 > 0 0 1 0 0 0 0 0 0 > 0 1 0 1 0 1 0 1 1 > 0 1 1 1 1 0 0 0 0 > 1 0 0 1 0 0 0 0 0 > 1 0 1 1 1 1 1 0 1 > 1 1 0 0 0 1 0 1 1 > 1 1 1 0 0 1 1 0 1 > > > The change has been tested by TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. > > The performance is measured on EC2 m6g instance (Graviton2) and shows 18-25% improvement. > Baseline > > Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 15 2989.149 ? 54.895 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 15 24.927 ? 0.002 ops/ms > MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2433.184 ? 74.616 ops/ms > MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 24.736 ? 0.002 ops/ms > > Optimized > > Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 15 3719.214 ? 23.087 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 15 31.280 ? 0.003 ops/ms > MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2874.308 ? 88.455 ops/ms > MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 31.014 ? 0.060 ops/ms Marked as reviewed by luhenry (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/11748 From roland at openjdk.org Wed Dec 21 14:50:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 21 Dec 2022 14:50:03 GMT Subject: RFR: 8297724: Loop strip mining prevents some empty loops from being eliminated [v2] In-Reply-To: References: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> Message-ID: On Thu, 15 Dec 2022 20:52:22 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: >> >> - Update test/hotspot/jtreg/compiler/c2/irTests/TestLSMMissedEmptyLoop.java >> >> Co-authored-by: Tobias Hartmann >> - Update src/hotspot/share/opto/loopTransform.cpp >> >> Co-authored-by: Tobias Hartmann > > Looks good to me. Thank you for fixing it. @vnkozlov @TobiHartmann thanks for the reviews (and testing). ------------- PR: https://git.openjdk.org/jdk/pull/11699 From roland at openjdk.org Wed Dec 21 14:50:06 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 21 Dec 2022 14:50:06 GMT Subject: Integrated: 8297724: Loop strip mining prevents some empty loops from being eliminated In-Reply-To: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> References: <0h64UphIm15r5MArLZeVhafwWGb6iXMsNreW9lqai3A=.8d4955e7-cfc9-4e6b-a4b4-72bd1f051c9f@github.com> Message-ID: On Thu, 15 Dec 2022 16:43:07 GMT, Roland Westrelin wrote: > When an empty loop is found, it's removed and as a consequence the > outer strip mine loop and the safepoint that it contains are also > removed. A counted loop is empty if it has the minimum number of nodes > that a well formed counted loop contains. In some cases, the loop has > extra nodes and the safepoint in the outer loop is the only node that > keeps those extra nodes alive. If the safepoint was to be removed, > then the counted loop would have the minimum number of nodes and be > considered empty. But the safepoint can't be removed until the loop is > considered empty which only happens if it has the minimum of nodes. As > a result, these loops are not removed. Note that now that the loop > strip mining loop nest is constructed even if UseCountedLoopSafepoints > is false, there's a regression where some loops used to be removed as > empty before but not anymore. > > The fix I propose is to extend IdealLoopTree::do_remove_empty_loop() > so it handles those cases. If it encounters a loop with no flow > control in the loop body but a number of nodes greater than the > minimum number of nodes, it starts from the extra nodes in the loop > body and follows uses until it finds a side effect, ignoring the > safepoint of the outer loop. If it finds none, then the extra nodes > can be removed and the loop is empty. This also works if the extra > nodes are kept alive by the safepoints of 2 different counted loops > and one can only be proven empty if the other one is as well (and the > other one proven empty if the first one is) and should work even if > there are more than 2 nodes involved.. This pull request has now been integrated. Changeset: 88bfe4d3 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/88bfe4d3bff5504bb6061d1484325dd6a55f06a2 Stats: 304 lines in 3 files changed: 296 ins; 4 del; 4 mod 8297724: Loop strip mining prevents some empty loops from being eliminated Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11699 From thartmann at openjdk.org Wed Dec 21 15:04:51 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 21 Dec 2022 15:04:51 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. Looks good to me. Thanks for the detailed explanation and comments. Tests are running, I'll report back once they pass. > I still believe it's risky to simply drop the OpaqueZeroTripGuard for the post loop even if it can't constant fold at least because we wouldn't want the zero trip guard to split thru phi. Should we investigate this for JDK 21? ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk20/pull/65 From roland at openjdk.org Wed Dec 21 15:18:23 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 21 Dec 2022 15:18:23 GMT Subject: RFR: 8298848: C2: clone all of (CmpP (LoadKlass (AddP down at split if [v2] In-Reply-To: References: Message-ID: > As suggested by Vladimir in: > https://github.com/openjdk/jdk/pull/11666 > > Thus extract one for the fixes as a separate PR. The bug as described > in the above PR is: > > The crash occurs because a` (If (Bool (CmpP (LoadKlass ..))))` > only has a single projection. It lost the other projection because of > a `CheckCastPP` that becomes `top`. Initially the pattern is, in pseudo > code: > > > if (obj.klass == some_class) { > obj = CheckCastPP#1(obj); > } > > > `obj` itself is a `CheckCastPP` that's pinned at a dominating if. That > dominating if goes through split through phi. The `LoadKlass` for the > pseudo code above also has control set to the dominating if being > transformed. This result in: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(phi2); > } > > > with` phi1 = (Phi (LoadKlass obj) (LoadKlass obj))` and phi2 = (Phi obj obj) > with `obj = (CheckCastPP#2 obj')` > > `PhiNode::Ideal()` transforms `phi2` into a new `CheckCastPP`: > `(CheckCastPP#3 obj' obj') `with control set to the region right above > the if in the pseudo code above. There happens to be another > `CheckCastPP` at the same control which casts obj' to a narrower > type. So the new `CheckCastPP#3` is replaced by that one (because of > `ConstraintCastNode::dominating_cast()`) and pseudo code becomes: > > > if (phi1 == some_class) { > obj = CheckCastPP#1(CheckCastPP#4(obj')); > } > > > and then: > > > if (phi1 == some_class) { > obj = top; > } > > > because the types of the 2 `CheckCastPP`s conflict. That would be ok if: > > `phi1 == some_class` > > would constant fold. It would if the test was: > > `if (CheckCastPP#4(obj').klass == some_klass) { > ` > but because of split if, the `(CmpP (LoadKlass ..))` and the > `CheckCastPP#1` ended up with 2 different object inputs that then were > transformed differently. The fix I propose is to have split if clone the entire: > > `(Bool (CmpP (LoadKlass (AddP ..))))` > > down the same way `(Bool (CmpP ..))` is cloned down. After split if, the > pseudo code becomes: > > > if (phi.klass == some_class) { > obj = CheckCastPP#1(phi); > } > > > The bug can't occur because the `CheckCastPP` and` (CmpP (LoadKlass ..))` > operate on the same phi input. The change in split_if.cpp implements > that. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - fix test on x86 32 bits - Merge branch 'master' into JDK-8298848 - test & fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11689/files - new: https://git.openjdk.org/jdk/pull/11689/files/e4b4c299..31fc40f7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11689&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11689&range=00-01 Stats: 7942 lines in 396 files changed: 4137 ins; 1760 del; 2045 mod Patch: https://git.openjdk.org/jdk/pull/11689.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11689/head:pull/11689 PR: https://git.openjdk.org/jdk/pull/11689 From chagedorn at openjdk.org Wed Dec 21 15:42:57 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 21 Dec 2022 15:42:57 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. That looks reasonable to me. You should change the bug title as we are now also removing the OpaqueZeroTripGuard node for the main loop and we did not introduce a main and post loop specific opaque node. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > > the post loop even if it can't constant fold at least because we > > wouldn't want the zero trip guard to split thru phi. > > Should we investigate this for JDK 21? I think that would be a good idea. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/65 From xliu at openjdk.org Wed Dec 21 16:51:57 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 21 Dec 2022 16:51:57 GMT Subject: Integrated: 8299061: Using lambda to optimize GraphKit::compute_stack_effects() In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 07:18:05 GMT, Xin Liu wrote: > Some bytecodes donot need rsize in GraphKit::compute_stack_effects(). > This change defines rsize as a lambda and avoid computation if it's not in use. > > We utilize CTW($openjdk/test/hotspot/jtreg/testlibrary/ctw/dist) to test this patch. > We compile 2 builtin modules: java.base and jdk.compiler(javac). > > > JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:java.base > JAVA_OPTIONS="-XX:+CITime -XX:-TieredCompilation" ./ctw.sh modules:jdk.compiler > > For jdk.compiler module, the average compilation speed increases from 13236 bytes/s to 13303 bytes/s, or +0.51%. > For java.base module, the average compilation speed is almost same(+0.3%) > > > | java.base | before | after | diff | > |---------------------------|-----------|-----------|-------| > | C2 Compile Time(s) | 341.218 | 339.75 | | > | Parse Time(s) | 125.403 | 125.034 | | > | ratio of parse | 36.75% | 36.80% | | > | throughput(bytes/s) | 11841.628* | 11876.831 | 0.30% | > | Average compilation speed | 11779 | 11816 | 0.31% | > | Total compiled methods | 58148 | 58158 | | > | | | | | > > > | jdk.compiler | before | after | | > |---------------------------|-----------|-----------|-------| > | C2 Compile Time(s) | 66.486 | 66.087 | | > | Parse Time(s) | 21.877 | 21.731 | | > | ratio of parse | 32.90% | 32.88% | | > | throughput(bytes/s) | 14342.883 | 14436.494 | 0.65% | > | Average compilation speed | 13236 | 13303 | 0.51% | > | Total compiled methods | 13734 | 13730 | | > | | | | | > > > *Throughtput is reported from Tier4 row. It's very close to 'Average compilation speed' but not exactly same. eg. here is from java.base module. > > > before: > Tier4 {speed: 11841.628 bytes/s; standard: 339.754 s, 4014210 bytes, 58115 methods; osr: 0.345 s, 13109 bytes, 33 methods; nmethods_size: 52147560 bytes; nmethods_code_size: 32146360 bytes} > after: > Tier4 {speed: 11876.831 bytes/s; standard: 338.350 s, 4009949 bytes, 58126 methods; osr: 0.368 s, 12951 bytes, 32 methods; nmethods_size: 52122744 bytes; nmethods_code_size: 32124952 bytes} This pull request has now been integrated. Changeset: 10d62fa2 Author: Xin Liu URL: https://git.openjdk.org/jdk/commit/10d62fa2183c0ed252ad0a9a743ae6a7710f9a95 Stats: 17 lines in 1 file changed: 6 ins; 6 del; 5 mod 8299061: Using lambda to optimize GraphKit::compute_stack_effects() Reviewed-by: kbarrett, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11737 From kvn at openjdk.org Wed Dec 21 18:41:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 21 Dec 2022 18:41:50 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. Nice. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk20/pull/65 From kvn at openjdk.org Wed Dec 21 18:49:48 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 21 Dec 2022 18:49:48 GMT Subject: RFR: 8299191: Unnecessarily global friend functions for relocInfo In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 10:17:12 GMT, Kim Barrett wrote: > Please review this small cleanup around the relocInfo class. It declares a > couple of global functions as friends, so they have access to private > constructors and helper functions. But there is no reason for these functions > to be at global scope. It is more natural for them to be static factory > functions in relocInfo. > > Testing: > mach5 tier1 Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11753 From kbarrett at openjdk.org Thu Dec 22 02:32:48 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 22 Dec 2022 02:32:48 GMT Subject: RFR: 8299191: Unnecessarily global friend functions for relocInfo In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 11:31:19 GMT, Christian Hagedorn wrote: >> Please review this small cleanup around the relocInfo class. It declares a >> couple of global functions as friends, so they have access to private >> constructors and helper functions. But there is no reason for these functions >> to be at global scope. It is more natural for them to be static factory >> functions in relocInfo. >> >> Testing: >> mach5 tier1 > > src/hotspot/share/code/relocInfo.cpp line 89: > >> 87: } >> 88: // cannot compact, so just update the count and return the limit pointer >> 89: (*this) = prefix_info(plen); // write new datalen > > Just a minor thing: Is there a specific reason for these additional whitespaces? No, I've no idea where those spaces came from. I thought maybe I'd hit `meta-;` (emacs indent for comment), but nope, that puts a different number of spaces there. And it's not manually lined up with other comments in the function. ------------- PR: https://git.openjdk.org/jdk/pull/11753 From thartmann at openjdk.org Thu Dec 22 06:03:57 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 22 Dec 2022 06:03:57 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: <42RnkJNJuhewmZ6PJQw9LiVF4O_XRGjDmzHkyHPj98c=.f0bd094b-1e0c-4e8c-ac7e-b7c883f1641e@github.com> On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. All tests passed. ------------- PR: https://git.openjdk.org/jdk20/pull/65 From roland at openjdk.org Thu Dec 22 08:05:52 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Dec 2022 08:05:52 GMT Subject: RFR: 8298848: C2: clone all of (CmpP (LoadKlass (AddP down at split if [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 15:18:23 GMT, Roland Westrelin wrote: >> As suggested by Vladimir in: >> https://github.com/openjdk/jdk/pull/11666 >> >> Thus extract one for the fixes as a separate PR. The bug as described >> in the above PR is: >> >> The crash occurs because a` (If (Bool (CmpP (LoadKlass ..))))` >> only has a single projection. It lost the other projection because of >> a `CheckCastPP` that becomes `top`. Initially the pattern is, in pseudo >> code: >> >> >> if (obj.klass == some_class) { >> obj = CheckCastPP#1(obj); >> } >> >> >> `obj` itself is a `CheckCastPP` that's pinned at a dominating if. That >> dominating if goes through split through phi. The `LoadKlass` for the >> pseudo code above also has control set to the dominating if being >> transformed. This result in: >> >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(phi2); >> } >> >> >> with` phi1 = (Phi (LoadKlass obj) (LoadKlass obj))` and phi2 = (Phi obj obj) >> with `obj = (CheckCastPP#2 obj')` >> >> `PhiNode::Ideal()` transforms `phi2` into a new `CheckCastPP`: >> `(CheckCastPP#3 obj' obj') `with control set to the region right above >> the if in the pseudo code above. There happens to be another >> `CheckCastPP` at the same control which casts obj' to a narrower >> type. So the new `CheckCastPP#3` is replaced by that one (because of >> `ConstraintCastNode::dominating_cast()`) and pseudo code becomes: >> >> >> if (phi1 == some_class) { >> obj = CheckCastPP#1(CheckCastPP#4(obj')); >> } >> >> >> and then: >> >> >> if (phi1 == some_class) { >> obj = top; >> } >> >> >> because the types of the 2 `CheckCastPP`s conflict. That would be ok if: >> >> `phi1 == some_class` >> >> would constant fold. It would if the test was: >> >> `if (CheckCastPP#4(obj').klass == some_klass) { >> ` >> but because of split if, the `(CmpP (LoadKlass ..))` and the >> `CheckCastPP#1` ended up with 2 different object inputs that then were >> transformed differently. The fix I propose is to have split if clone the entire: >> >> `(Bool (CmpP (LoadKlass (AddP ..))))` >> >> down the same way `(Bool (CmpP ..))` is cloned down. After split if, the >> pseudo code becomes: >> >> >> if (phi.klass == some_class) { >> obj = CheckCastPP#1(phi); >> } >> >> >> The bug can't occur because the `CheckCastPP` and` (CmpP (LoadKlass ..))` >> operate on the same phi input. The change in split_if.cpp implements >> that. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - fix test on x86 32 bits > - Merge branch 'master' into JDK-8298848 > - test & fix I added a `-XX:+IgnoreUnrecognizedVMOptions` to the test case so it runs on 32 bits (compressed oops not supported there) ------------- PR: https://git.openjdk.org/jdk/pull/11689 From roland at openjdk.org Thu Dec 22 08:58:58 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Dec 2022 08:58:58 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: <42RnkJNJuhewmZ6PJQw9LiVF4O_XRGjDmzHkyHPj98c=.f0bd094b-1e0c-4e8c-ac7e-b7c883f1641e@github.com> References: <42RnkJNJuhewmZ6PJQw9LiVF4O_XRGjDmzHkyHPj98c=.f0bd094b-1e0c-4e8c-ac7e-b7c883f1641e@github.com> Message-ID: <7-vHszZKDM64eN_CnPsmGHcmYwIfGiNRXAqw6kKsyG4=.e9ac232b-e4db-49f5-b380-7745745dc772@github.com> On Thu, 22 Dec 2022 06:01:24 GMT, Tobias Hartmann wrote: >> As described in https://github.com/openjdk/jdk20/pull/22, the bug is >> caused by the iv phi of a post loop that becomes top but because the >> post loop is guarded by an opaque node, the control flow remains >> alive. >> >> The fix I propose is based on this comment Vladimir made: >> https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 >> >> When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) >> in2), it runs Value() on (CmpI in1 in2) and if it constant folds so >> that the loop is not taken, returns that result. Translating "loop not >> taken" into an actual CmpI type depends on whether the loop goes up or >> down. To make the check above possible, OpaqueZeroTripGuard includes >> the BoolTest::mask that causes the loop to be executed at the zero >> trip guard. >> >> The new logic in CmpINode::Value() is executed for both the main and >> post loop zero trip guards (while the bug was only seen AFAIK with the >> post loop) because I expect the same bug to exist with the main loop. >> >> For the main loop, this works because initially the loop should be >> executed and as optimizations proceed and adjust the zero trip guard, >> the range of iterations executed in the loop should narrow (and never >> widen). We may then end up with no iterations executed in the loop. No >> further optimizations would make the main loop executable again. It's >> then fine to fold the zero trip guard as we're done with >> optimizations. >> >> This works for the post loop because the compiler has no way to tell >> whether it's executed or not as long as there's a main loop: the zero >> trip guard then takes as input a phi that merges the pre and main loop >> ivs. For the case of a loop going up, the zero trip guard should >> initially test whether [init, limit] (the type of phi) is stricly less >> than limit. The compiler can't decide what the result of that test >> is. As optimizations proceed, the [init, limit] range could become >> narrower as I understand and there's no risk for the compiler to >> report the post loop as not taken. >> >> I still believe it's risky to simply drop the OpaqueZeroTripGuard for >> the post loop even if it can't constant fold at least because we >> wouldn't want the zero trip guard to split thru phi. > > All tests passed. @TobiHartmann @chhagedorn @vnkozlov thanks for the reviews. ------------- PR: https://git.openjdk.org/jdk20/pull/65 From roland at openjdk.org Thu Dec 22 08:59:00 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Dec 2022 08:59:00 GMT Subject: [jdk20] Integrated: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. This pull request has now been integrated. Changeset: a0a09d56 Author: Roland Westrelin URL: https://git.openjdk.org/jdk20/commit/a0a09d56ba4fc6133b423ad29d86fc99dd6dc19b Stats: 191 lines in 7 files changed: 188 ins; 0 del; 3 mod 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears Reviewed-by: thartmann, chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk20/pull/65 From roland at openjdk.org Thu Dec 22 09:07:55 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Dec 2022 09:07:55 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: <1eGJHIDdmnfPKx19116PjHmNKXajvlvhLMPN0NXsRCQ=.e1499bfe-c5f4-465f-a461-71f3a36019e7@github.com> On Wed, 21 Dec 2022 15:40:12 GMT, Christian Hagedorn wrote: > You should change the bug title as we are now also removing the OpaqueZeroTripGuard node for the main loop and we did not introduce a main and post loop specific opaque node. Oops. I integrated without addressing this comment. I don't think there's anything that can be done now. Sorry about that. ------------- PR: https://git.openjdk.org/jdk20/pull/65 From chagedorn at openjdk.org Thu Dec 22 09:21:02 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 22 Dec 2022 09:21:02 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. No worries :-) ------------- PR: https://git.openjdk.org/jdk20/pull/65 From roland at openjdk.org Thu Dec 22 10:49:05 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Dec 2022 10:49:05 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 15:40:12 GMT, Christian Hagedorn wrote: > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > > the post loop even if it can't constant fold at least because we > > wouldn't want the zero trip guard to split thru phi. > > Should we investigate this for JDK 21? http://cr.openjdk.java.net/~roland/TestCountedLoop.java is a test case that fails with the opaque node of the zero trip guard for the post loop removed (incorrect execution). It only fails with a 32 bit build because split if is prevented on 64 bits by a CastII (see merge_point_safe()). What happens is that right after PeelMainPost, split if pushes the post loop zero trip guard through the region that dominates. The copy of the zero trip guard that's now right after the main loop tests the same condition as the main loop exit and so is removed as redundant. The post loop becomes reachable only from the zero trip guard above the main loop (but that one folds after loop opts is done and then there's no more post loop). Unrolling then proceeds but there's no post loop to executed the remaining iterations once the main loop is done. ------------- PR: https://git.openjdk.org/jdk20/pull/65 From thartmann at openjdk.org Thu Dec 22 11:26:51 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 22 Dec 2022 11:26:51 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. Thanks for investigating, Roland. If the sole purpose of the OpaqueZeroTripGuard is to prevent split thru phi, I'm wondering if it wouldn't be better to special case the zero trip guard there and prevent it from being split through? ------------- PR: https://git.openjdk.org/jdk20/pull/65 From roland at openjdk.org Thu Dec 22 11:39:58 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Dec 2022 11:39:58 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: <1QtaLx5AEnQEq1DraQ8Ep4HlhsEkuLOlBjOACehjgRw=.7f1ae3a2-5a16-4841-bf3d-ca69b0509588@github.com> On Thu, 22 Dec 2022 11:24:11 GMT, Tobias Hartmann wrote: > Thanks for investigating, Roland. If the sole purpose of the OpaqueZeroTripGuard is to prevent split thru phi, I'm wondering if it wouldn't be better to special case the zero trip guard there and prevent it from being split through? How do we do that though? How can we tell at split if that the (If (Bool (CmpI ..))) is a zero trip guard? ------------- PR: https://git.openjdk.org/jdk20/pull/65 From thartmann at openjdk.org Thu Dec 22 11:49:03 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 22 Dec 2022 11:49:03 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 18:47:26 GMT, Roland Westrelin wrote: > As described in https://github.com/openjdk/jdk20/pull/22, the bug is > caused by the iv phi of a post loop that becomes top but because the > post loop is guarded by an opaque node, the control flow remains > alive. > > The fix I propose is based on this comment Vladimir made: > https://github.com/openjdk/jdk20/pull/22#issuecomment-1349570615 > > When CmpINode::Value() encounters a (CmpI (OpaqueZeroTripGuard in1) > in2), it runs Value() on (CmpI in1 in2) and if it constant folds so > that the loop is not taken, returns that result. Translating "loop not > taken" into an actual CmpI type depends on whether the loop goes up or > down. To make the check above possible, OpaqueZeroTripGuard includes > the BoolTest::mask that causes the loop to be executed at the zero > trip guard. > > The new logic in CmpINode::Value() is executed for both the main and > post loop zero trip guards (while the bug was only seen AFAIK with the > post loop) because I expect the same bug to exist with the main loop. > > For the main loop, this works because initially the loop should be > executed and as optimizations proceed and adjust the zero trip guard, > the range of iterations executed in the loop should narrow (and never > widen). We may then end up with no iterations executed in the loop. No > further optimizations would make the main loop executable again. It's > then fine to fold the zero trip guard as we're done with > optimizations. > > This works for the post loop because the compiler has no way to tell > whether it's executed or not as long as there's a main loop: the zero > trip guard then takes as input a phi that merges the pre and main loop > ivs. For the case of a loop going up, the zero trip guard should > initially test whether [init, limit] (the type of phi) is stricly less > than limit. The compiler can't decide what the result of that test > is. As optimizations proceed, the [init, limit] range could become > narrower as I understand and there's no risk for the compiler to > report the post loop as not taken. > > I still believe it's risky to simply drop the OpaqueZeroTripGuard for > the post loop even if it can't constant fold at least because we > wouldn't want the zero trip guard to split thru phi. Right, that's the difficult part. I was hoping that we could tag the if with a flag or something. But maybe it's not worth it. ------------- PR: https://git.openjdk.org/jdk20/pull/65 From epeter at openjdk.org Thu Dec 22 11:59:54 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 22 Dec 2022 11:59:54 GMT Subject: RFR: 8280126: C2: detect and remove dead irreducible loops Message-ID: <4_BdxWOo_kg-JwxP_qaTj4-JIF_plwvH5QIK3UrBG7A=.ecaaa2ec-38c4-4fd0-baba-5bece223df58@github.com> **Context** If a `LoopNode` loses entry control, we remove it, to prevent having a dead-loop (backedge would be only input to LoopNode): https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/cfgnode.cpp#L541-L544 We must remove such dead code, otherwise all sorts of bad graph patterns can be created, including self-referring Add nodes etc, and that would either hit asserts, or crash the VM. Also `PhiNode` does some checks to avoid creating a dead-loop: https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/cfgnode.cpp#L2004-L2019 However, all of this logic assumes that we have properly canonicalized reducible loops: every loop-head must be a `LoopNode`, where we have a loop-entry-control, and a backedge-control. Once the loop-entry-control dies, we know the loop is dead-code. **Problem** This dead-loop removal logic does not work for irreducible loops. I found many JASM examples, and even was able to produce a Java reproducer. I have seen these asserts triggered: Self-referencing data node: https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/phaseX.cpp#L943 We find dead-code CFG nodes: https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/loopnode.cpp#L5293 Dead loop of phi's without any input: https://github.com/openjdk/jdk/blob/8c472e481676ed0ef475c4989477d5714880c59e/src/hotspot/share/opto/cfgnode.cpp#L2539 We must remove the loop once it loses its last entry-control. The problem is that a irreducible loop has multiple entries, by definition. Irreducible loops have no `LoopNodes`, they are simply `RegionNodes` with multiple inputs. If one of the controls is lost, it is a priori not clear if this was the last entry-control for the whole irreducible loop. **Solution Summary** We mark every `RegionNode` with one of these three: https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/cfgnode.hpp#L103-L112 We remove irreducible loops like this: https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/cfgnode.cpp#L591-L603 And during `PhaseIdealLoop::build_loop_tree` we verify that all irreducible loop-entries are marked as `MaybeIrreducibleEntry`: https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/loopnode.cpp#L5161-L5162 Additionally, we can verify that no irreducible loop contains regions marked as `Reducible`: https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/loopnode.cpp#L5277-L5279 **Solution Details** During parsing, we call `ciTypeFlow::Block::copy_irreducible_status_to(RegionNode* region, const JVMState* jvms)` for every `RegionNode` created for block-merges. This checks `ciTypeFlow::Block::is_in_irreducible_loop()`, to see if the relevant block of this region is in an irreducible loop (see "Alternative Solutions" below, why I mark all regions inside irreducible loops with `MaybeIrreducibleEntry`). For this to work, I had to slightly improve the irreducible loop detection in `ciTypeFlow`. One could improve the marking after every `PhaseIdealLoop::build_loop_tree`, but that comes at the cost of more compile-time, so I did not implement it. In some cases, new regions are created that need to be marked with `MaybeIrreducibleEntry`. For example `IdealLoopTree::split_fall_in` can split a irreducible loop head, after which both are irreducible loop entries: https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/loopnode.cpp#L3143-L3144 Regions for which we do not explicitly set `MaybeIrreducibleEntry` are marked as `NeverIrreducibleEntry`. We do this for any new regions that are added, for example by the `GraphKit`. They all seem to be safe, as those regions can never become irreducible loop entries. I tried to mark as many regions as possible with `Reducible`, so that we can do stronger asserts. So no enclosing loop of a region is irreducible, we mark it `Reducible`, and assert that it will never be inside a irreducible loop. However, checking for an outher irreducible loop turns out to be tricky when we have inlining: regions in the inner method need to check if they are in a irreducible loop of the outer method. I tried to implement this, but found it to be too difficult (We would need to find the block in the outer method where the inlining of the inner method happens. An additional complication is that the method only stores the non-OSR ciTypeFlow, even if the outer method is OSR compiled - thus the irreducible status can be inaccurate). So I just mark the nodes as `NeverIrreducibleEntry` if they are not in an irreducible loop in the current loop, and there is an outer loop. This is safe (they can never be irreducible entries): the region would have to merge a "backedge" and an "entry", both separately entering the inlined method, but there is only a single entry point to the inlined method. Also `PhiNodes` have to be handled more carefully, hence I block phi's in irreducible loops from acting on `TOP` inputs until the `Region` has a chance to react: https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/cfgnode.cpp#L2048-L2054 When we detect a subgraph is a dead-loop, we remove it with https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/cfgnode.cpp#L797 I had to refactor it a bit to be more agressive (first gather all dead nodes, only then remove them), which lead to the discovery of a few optimizations that could not deal with TOP inputs. It has been known that `split_flow_path` can cause new irreducible loops to appear [JDK-6742111](https://bugs.openjdk.org/browse/JDK-6742111). With my marking of regions and the verification of it, we cannot allow new irreducible loops to appear. However, from the testing and fuzzing I have performed, it seems that this can only happen when there is already another irreducible loop in the graph. Thus it seems sufficient to disable the optimization if there are already irreducible loops present: https://github.com/openjdk/jdk/blob/e2cc4229dd6e696847ebfecb19ab5d4b8621e31d/src/hotspot/share/opto/cfgnode.cpp#L1841-L1853 **Alternative Solutions** It is the long-term goal to remove irreducible loops from the graph, either by node-splitting or the dispatcher approach. So for now, we want a fix that works and is not too complex. If this fix turns out to be too slow, especially because of additional readability traversals, then we may need to revisit alternatives. A first and most brute-force approach would have been to simply do a reachability check for all regions, once an input control is lost. Or only do it if there is an irreducible loop anywhere in the graph. But that would clearly lead to some slowdown for OSR compilation, as they often have some irreducible loop in the graph. It is thus better to limit the number of nodes that need to check reachability. _Why did I not exclusively mark regions that are irreducible loop entries?_ An entry of an irreducible loop can lose all internal edges ("backedges"), collapse, and float outside the loop. The entry is now further down the CFG from the old entry, possibly through a series of if/region. One could attempt to move the marking to the new entries, but that would be a complex task. An example can be found in regression test `test_009`. _Can we identify the smallest set of nodes that would ever be irreducible entries?_ This is tricky. `build_loop_tree` finds loop-heads, which would certainly all have to be marked. However, finding all secondary entries is something one would have to do. Currently, when we find a second entry, we stop there, but to determine all secondary entries one would probably have to traverse further into the loop again. One might be able to mark less nodes, in some cases where we have reducible loops inside irreducible loops, where the loop-head of the (inner) reducible loop is not an entry of the (outer) irreducible loop. It is not yet clear to me how to do that, and if it really speeds things up enough to justify the added complexity. **Testing** This bug was reported with a modified classfile, as far as we know the bytecode must have been modified/fuzzed. I then wrote my own bytecode fuzzer that produces JASM code with irreducible loops, and quickly found all sorts of similar failures. I am already working on porting this fuzzer to Java [JDK-8299214](https://bugs.openjdk.org/browse/JDK-8299214), and hopefully integrate it into testing. It did not just find issues with irreducible loops, but also with infinite loops. I added many tests, some reduced down from that fuzzer, some hand-crafted. So far, this change passes stress-testing and tier1-tier4. I will test up to tier7 soon. **TODO**: once we have agreement on the patch, only run `verify_regions_in_irreducible_loops` during loopopts verification phase. For now I leave it in to improve testing capability, at cost of extra runtime. ------------- Commit messages: - 8280126: C2: detect and remove dead irreducible loops Changes: https://git.openjdk.org/jdk/pull/11764/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11764&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8280126 Stats: 2138 lines in 12 files changed: 2069 ins; 46 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/11764.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11764/head:pull/11764 PR: https://git.openjdk.org/jdk/pull/11764 From chagedorn at openjdk.org Thu Dec 22 15:18:49 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 22 Dec 2022 15:18:49 GMT Subject: RFR: 8299191: Unnecessarily global friend functions for relocInfo In-Reply-To: References: Message-ID: On Thu, 22 Dec 2022 02:29:56 GMT, Kim Barrett wrote: >> src/hotspot/share/code/relocInfo.cpp line 89: >> >>> 87: } >>> 88: // cannot compact, so just update the count and return the limit pointer >>> 89: (*this) = prefix_info(plen); // write new datalen >> >> Just a minor thing: Is there a specific reason for these additional whitespaces? > > No, I've no idea where those spaces came from. I thought maybe I'd hit `meta-;` (emacs indent for comment), but nope, that puts a different number of spaces there. And it's not manually lined up with other comments in the function. Okay :-) ------------- PR: https://git.openjdk.org/jdk/pull/11753 From kbarrett at openjdk.org Thu Dec 22 17:35:16 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 22 Dec 2022 17:35:16 GMT Subject: RFR: 8299191: Unnecessarily global friend functions for relocInfo [v2] In-Reply-To: References: Message-ID: > Please review this small cleanup around the relocInfo class. It declares a > couple of global functions as friends, so they have access to private > constructors and helper functions. But there is no reason for these functions > to be at global scope. It is more natural for them to be static factory > functions in relocInfo. > > Testing: > mach5 tier1 Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into relocinfo-friends - make friend functions instead be static members ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11753/files - new: https://git.openjdk.org/jdk/pull/11753/files/90893679..86f991f3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11753&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11753&range=00-01 Stats: 4233 lines in 239 files changed: 2810 ins; 757 del; 666 mod Patch: https://git.openjdk.org/jdk/pull/11753.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11753/head:pull/11753 PR: https://git.openjdk.org/jdk/pull/11753 From kbarrett at openjdk.org Thu Dec 22 17:35:16 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 22 Dec 2022 17:35:16 GMT Subject: RFR: 8299191: Unnecessarily global friend functions for relocInfo [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 18:47:19 GMT, Vladimir Kozlov wrote: >> Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into relocinfo-friends >> - make friend functions instead be static members > > Good. Thanks for reviews @vnkozlov and @chhagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11753 From kbarrett at openjdk.org Thu Dec 22 17:35:17 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 22 Dec 2022 17:35:17 GMT Subject: Integrated: 8299191: Unnecessarily global friend functions for relocInfo In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 10:17:12 GMT, Kim Barrett wrote: > Please review this small cleanup around the relocInfo class. It declares a > couple of global functions as friends, so they have access to private > constructors and helper functions. But there is no reason for these functions > to be at global scope. It is more natural for them to be static factory > functions in relocInfo. > > Testing: > mach5 tier1 This pull request has now been integrated. Changeset: 62a033ec Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/62a033ecd7058f4a4354ebdcd667b3d7991e1f3d Stats: 23 lines in 3 files changed: 5 ins; 12 del; 6 mod 8299191: Unnecessarily global friend functions for relocInfo Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11753 From jrose at openjdk.org Thu Dec 22 20:20:57 2022 From: jrose at openjdk.org (John R Rose) Date: Thu, 22 Dec 2022 20:20:57 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v18] In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 13:51:45 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup, rename and some testing I am sorry, but we need to pump the brakes here. This proposal is not acceptable in its current form. There's a classic tradeoff between writing specialized encodings for each kind of data, versus using shared encodings. The specialized encodings might be more compact and/or faster to read or write because they are tuned for a particular kind of data. The shared encodings might be less compact, etc., but because they are shared the work of learning, documenting, analyzing, and testing them is done once. That's why people use LEB128 as a varint in so many places: It is not the best varint for most purposes, but it is easy to learn and well-understood. You know what you are getting when you use it. These varint formats are rather easy to create. Here we have one that has good statistics for a particular kind of data, in the cases available today we have tested. But if we accept it into our source base, maintainers will have to learn to deal with it in the future. If the statistics of the data change (this will surely happen at some point) then what is a nice local optimization will become an analysis problem, a black box that maintainers will not want to touch. To address such maintenance problems, we have picked off-the-shelf algorithms, documented and implemented them carefully, and reused them in our code base. In HotSpot, we are using UNSIGNED5 (from Pack200 in the Java ecosystem) as an internal varint, yet another one. It is slightly better than LEB128. (We also use UTF8, of course, for similar purposes.) We have factored out the algorithm so that it can be shared in more places in HotSpot. For example, there is a line of work where field-info structures will be compressed with var-ints. And there may be more to come; the reloc-info streams come to my mind as candidates for re-encoding. Having a centralized varint mechanism is the smart call, because it can be maintained on behalf of multiple use cases. And if it needs improvement or tuning, such changes will benefit all users. By contrast, using a local compression scheme for a particular kind of data, as proposed here, creates a separate account of technical debt in each place. So, let's not. Instead, if we notice that we have zero-rich data, let's first try to do something that composes with with the existing varint scheme (UNSIGNED5) and works on top of it, or underneath it. That is, stack a zero-reduction scheme with our varint scheme, instead of simultaneously inventing a zero-reduction scheme and a new varint scheme. As a very simple example of this, the first N (N=16, say) code points `J=[0..N-1]` of UNSIGNED5 can be special-cased to mean "there are J zero values here". A byte J in that range means that there are J+1 zeroes to decode. A byte beyond that range is first decoded as X0 and then adjusted as `X=X0-N+1`, so that it is a full-range 32-bit integer. The last N-1 values in the range, `[1-N..-1]`, will decode from 5-byte encodings, using 32-bit overflow, which the UNSIGNED5 algorithm is tolerant of. As another simple example, probably overkill but more similar to the present proposal, the first `N=2^K` code points could be reserved as above (say, `K=4,N=16`), and the decoder state would have not just a count of extra zero values, but rather a bitmask of them. In that case, the bitmask for some byte `J<2^K` would always be initialized to `2*J+1` so that there is always at least one bit set. And as a different suggestion, the most attractive off-the-shelf zero-reduction scheme I am aware of is in Capn Proto [1]. [1]: It is similar to the zero-reduction scheme proposed here, but is more battle-tested and probably better thought through. I suggest we investigate its use, as a standard compression technique, to compress these streams, preferably as-is, or perhaps adapted; preferably in conjunction with our existing varint scheme, or stand-alone. I suspect (though I am not certain) that Capn Proto's packing scheme would work fine as a byte source for UNSIGNED5. It is organized as a word-wise algorithm, which allows it to use fully-loaded registers as temps. Stacking the two together would involve streaming through packed words, sticking those words into a small buffer (one cache line is enough), and decoding from there as UNSIGNED5. Again, doing something like this (or either of the two suggestions above), would compose known-good techniques, would reduce maintenance burden, would solve the problem at hand (probably), and would avoid the technical debt of having multiple overlapping varint schemes in our code base. ------------- Changes requested by jrose (Reviewer). PR: https://git.openjdk.org/jdk/pull/10025 From jrose at openjdk.org Thu Dec 22 20:24:56 2022 From: jrose at openjdk.org (John R Rose) Date: Thu, 22 Dec 2022 20:24:56 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v18] In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 13:51:45 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup, rename and some testing P.S. One reason I know about the Capn Proto packing is as a candidate for fast streaming (de)compression of heap snapshots. We don't have that feature today, but may in the future for CDS and/or Leyden, and all of my arguments about using off-the-shelf techniques will apply there as well. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From kvn at openjdk.org Fri Dec 23 02:16:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 23 Dec 2022 02:16:02 GMT Subject: [jdk20] RFR: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Thu, 22 Dec 2022 11:45:44 GMT, Tobias Hartmann wrote: > Right, that's the difficult part. I was hoping that we could tag the if with a flag or something. But maybe it's not worth it. We can create new `class ZeroTripCheckNode : public IfNode` similar to `RangeCheckNode`. ------------- PR: https://git.openjdk.org/jdk20/pull/65 From xlinzheng at openjdk.org Fri Dec 23 03:57:49 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 23 Dec 2022 03:57:49 GMT Subject: RFR: 8299172: RISC-V: [TESTBUG] Fix stack alignment logic in jvmci RISCV64TestAssembler.java In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 06:13:04 GMT, Xiaolin Zheng wrote: > We observed a failure in JVMCI tests after `-ea -esa` turned on when running `./test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java`. > > Failure at the line [1]. > > > java.lang.AssertionError > at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitGrowStack(RISCV64TestAssembler.java:203) > at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitCallPrologue(RISCV64TestAssembler.java:239) > ... > ... > > > The failure output has been attached to the JBS issue link. > > To be short, the stack alignment should align with `16`, and we can align with the logic in AArch64 [2] and x86_64 [3]. The x86_64 one is inside a recent change. > > Tested along with other patches, and the failed test passed. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/riscv64/RISCV64TestAssembler.java#L193 > [2] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/aarch64/AArch64TestAssembler.java#L273-L285 > [3] https://github.com/openjdk/jdk/commit/277f0c24a2e186166bfe70fc93ba79aec10585aa Thanks for reviewing the trivial fix! ------------- PR: https://git.openjdk.org/jdk/pull/11751 From yyang at openjdk.org Fri Dec 23 06:25:01 2022 From: yyang at openjdk.org (Yi Yang) Date: Fri, 23 Dec 2022 06:25:01 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v2] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 08:53:10 GMT, Tobias Hartmann wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> fix > > src/hotspot/share/opto/memnode.cpp line 230: > >> 228: ->cast_to_size(t_oop->is_aryptr()->size()) >> 229: ->with_offset(t_oop->is_aryptr()->offset()) >> 230: ->is_aryptr(); > > Do we need `cast_to_stable` as well here? I think we need this even if it does not appear in this case ------------- PR: https://git.openjdk.org/jdk/pull/9777 From yyang at openjdk.org Fri Dec 23 06:32:35 2022 From: yyang at openjdk.org (Yi Yang) Date: Fri, 23 Dec 2022 06:32:35 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v4] In-Reply-To: References: Message-ID: > Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: > > ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) > > The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: > > The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: > > https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 > (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). > > There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). > > 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] > > > After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > The well-formed IR looks like this: > ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) > > Thanks for your patience. Yi Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - review from tobias - Merge branch 'master' into gvn_crash - fix - 8288204 GVN Crash: assert() failed: correct memory chain ------------- Changes: https://git.openjdk.org/jdk/pull/9777/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9777&range=03 Stats: 96 lines in 4 files changed: 82 ins; 8 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9777.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9777/head:pull/9777 PR: https://git.openjdk.org/jdk/pull/9777 From thartmann at openjdk.org Fri Dec 23 07:29:48 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 23 Dec 2022 07:29:48 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v4] In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 06:32:35 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - review from tobias > - Merge branch 'master' into gvn_crash > - fix > - 8288204 GVN Crash: assert() failed: correct memory chain Thanks for making these changes. Several tests (for example, compiler/arraycopy/TestArrayCopyAsLoadsStores.java) are now failing with: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (workspace/open/src/hotspot/share/opto/phaseX.cpp:843), pid=3983761, tid=3983777 # assert(i->_idx >= k->_idx) failed: Idealize should return new nodes, use Identity to return old nodes # # JRE version: Java(TM) SE Runtime Environment (21.0) (fastdebug build 21-internal-LTS-2022-12-23-0641545.tobias.hartmann.jdk2) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 21-internal-LTS-2022-12-23-0641545.tobias.hartmann.jdk2, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x179ad0c] PhaseGVN::transform_no_reclaim(Node*)+0xec Current CompileTask: C2: 2222 478 b 4 compiler.arraycopy.TestArrayCopyAsLoadsStores::m14 (9 bytes) Stack: [0x00007f86dc7f5000,0x00007f86dc8f6000], sp=0x00007f86dc8f20e0, free space=1012k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x179ad0c] PhaseGVN::transform_no_reclaim(Node*)+0xec (phaseX.cpp:843) V [libjvm.so+0x141be0f] LibraryCallKit::inline_arraycopy()+0x71f (library_call.cpp:5289) V [libjvm.so+0x1438712] LibraryIntrinsic::generate(JVMState*)+0x302 (library_call.cpp:115) V [libjvm.so+0xcbfbe9] Parse::do_call()+0x389 (doCall.cpp:662) V [libjvm.so+0x176c5f8] Parse::do_one_bytecode()+0x638 (parse2.cpp:2704) V [libjvm.so+0x175a734] Parse::do_one_block()+0x844 (parse1.cpp:1555) V [libjvm.so+0x175b697] Parse::do_all_blocks()+0x137 (parse1.cpp:707) V [libjvm.so+0x176021d] Parse::Parse(JVMState*, ciMethod*, float)+0xb3d (parse1.cpp:614) V [libjvm.so+0x918c40] ParseGenerator::generate(JVMState*)+0x110 (callGenerator.cpp:99) V [libjvm.so+0xb0275d] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x168d (compile.cpp:760) V [libjvm.so+0x916857] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x4e7 (c2compiler.cpp:113) V [libjvm.so+0xb0fa2c] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xa7c (compileBroker.cpp:2237) V [libjvm.so+0xb107e8] CompileBroker::compiler_thread_loop()+0x5d8 (compileBroker.cpp:1916) V [libjvm.so+0x107d066] JavaThread::thread_main_inner()+0x206 (javaThread.cpp:709) V [libjvm.so+0x1a723c0] Thread::call_run()+0x100 (thread.cpp:224) V [libjvm.so+0x1712553] thread_native_entry(Thread*)+0x103 (os_linux.cpp:739) ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9777 From yyang at openjdk.org Fri Dec 23 08:31:36 2022 From: yyang at openjdk.org (Yi Yang) Date: Fri, 23 Dec 2022 08:31:36 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v5] In-Reply-To: References: Message-ID: > Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: > > ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) > > The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: > > The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: > > https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 > (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). > > There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). > > 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] > > > After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > The well-formed IR looks like this: > ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) > > Thanks for your patience. Yi Yang has updated the pull request incrementally with one additional commit since the last revision: revert changes in array_copy_forward ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9777/files - new: https://git.openjdk.org/jdk/pull/9777/files/9bd483d7..04c082b8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9777&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9777&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9777.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9777/head:pull/9777 PR: https://git.openjdk.org/jdk/pull/9777 From thartmann at openjdk.org Fri Dec 23 08:32:00 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 23 Dec 2022 08:32:00 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 01:33:35 GMT, SUN Guoyun wrote: >> Tests passed and performance results looks good. I think the other change that you proposed should not be part of this. > > @TobiHartmann Thank you for your review. I have one more question for you, How did you test SPECjvm2008 performance? take the maximum or average value of multiple test results? @sunny868 We take the average value of multiple test runs. ------------- PR: https://git.openjdk.org/jdk/pull/11685 From yyang at openjdk.org Fri Dec 23 08:33:50 2022 From: yyang at openjdk.org (Yi Yang) Date: Fri, 23 Dec 2022 08:33:50 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v4] In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 07:26:39 GMT, Tobias Hartmann wrote: > Thanks for making these changes. > > Several tests (for example, compiler/arraycopy/TestArrayCopyAsLoadsStores.java) are now failing with: > > ``` > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (workspace/open/src/hotspot/share/opto/phaseX.cpp:843), pid=3983761, tid=3983777 > # assert(i->_idx >= k->_idx) failed: Idealize should return new nodes, use Identity to return old nodes > # > # JRE version: Java(TM) SE Runtime Environment (21.0) (fastdebug build 21-internal-LTS-2022-12-23-0641545.tobias.hartmann.jdk2) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 21-internal-LTS-2022-12-23-0641545.tobias.hartmann.jdk2, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x179ad0c] PhaseGVN::transform_no_reclaim(Node*)+0xec > > Current CompileTask: > C2: 2222 478 b 4 compiler.arraycopy.TestArrayCopyAsLoadsStores::m14 (9 bytes) > > Stack: [0x00007f86dc7f5000,0x00007f86dc8f6000], sp=0x00007f86dc8f20e0, free space=1012k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0x179ad0c] PhaseGVN::transform_no_reclaim(Node*)+0xec (phaseX.cpp:843) > V [libjvm.so+0x141be0f] LibraryCallKit::inline_arraycopy()+0x71f (library_call.cpp:5289) > V [libjvm.so+0x1438712] LibraryIntrinsic::generate(JVMState*)+0x302 (library_call.cpp:115) > V [libjvm.so+0xcbfbe9] Parse::do_call()+0x389 (doCall.cpp:662) > V [libjvm.so+0x176c5f8] Parse::do_one_bytecode()+0x638 (parse2.cpp:2704) > V [libjvm.so+0x175a734] Parse::do_one_block()+0x844 (parse1.cpp:1555) > V [libjvm.so+0x175b697] Parse::do_all_blocks()+0x137 (parse1.cpp:707) > V [libjvm.so+0x176021d] Parse::Parse(JVMState*, ciMethod*, float)+0xb3d (parse1.cpp:614) > V [libjvm.so+0x918c40] ParseGenerator::generate(JVMState*)+0x110 (callGenerator.cpp:99) > V [libjvm.so+0xb0275d] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x168d (compile.cpp:760) > V [libjvm.so+0x916857] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x4e7 (c2compiler.cpp:113) > V [libjvm.so+0xb0fa2c] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xa7c (compileBroker.cpp:2237) > V [libjvm.so+0xb107e8] CompileBroker::compiler_thread_loop()+0x5d8 (compileBroker.cpp:1916) > V [libjvm.so+0x107d066] JavaThread::thread_main_inner()+0x206 (javaThread.cpp:709) > V [libjvm.so+0x1a723c0] Thread::call_run()+0x100 (thread.cpp:224) > V [libjvm.so+0x1712553] thread_native_entry(Thread*)+0x103 (os_linux.cpp:739) > ``` Commenting out transformation in array_copy_forward works now, all test under test/hotspot/jtreg/compiler passed except tests that always failed. PhaseGVN::transform_no_reclaim still crashes when reverting this patch and only adding mm->transform in array_copy_forward, so at least this is not related to this fix. Though, I don't see why it causes the crash at first glance.. ------------- PR: https://git.openjdk.org/jdk/pull/9777 From duke at openjdk.org Fri Dec 23 09:30:57 2022 From: duke at openjdk.org (SUN Guoyun) Date: Fri, 23 Dec 2022 09:30:57 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate In-Reply-To: References: Message-ID: <2kU6jzkNwME9OGXbdI0CKEVFcajbjZeRoB9MpEzRVV0=.1c3d715b-d698-43a6-858c-4a3119e99939@github.com> On Tue, 20 Dec 2022 01:33:35 GMT, SUN Guoyun wrote: >> Tests passed and performance results looks good. I think the other change that you proposed should not be part of this. > > @TobiHartmann Thank you for your review. I have one more question for you, How did you test SPECjvm2008 performance? take the maximum or average value of multiple test results? > @sunny868 We take the average value of multiple test runs. How many times do I need to run SPECjvm2008? And is this average from all jvm2008 benchmarks or a single one? Are there any documentation or websites about performance testing for my reference? ------------- PR: https://git.openjdk.org/jdk/pull/11685 From thartmann at openjdk.org Fri Dec 23 10:00:55 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 23 Dec 2022 10:00:55 GMT Subject: RFR: JDK-8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate [v2] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 07:44:51 GMT, SUN Guoyun wrote: >> Hi all, >> For C2, convert double to float cause a loss of precision, >> >>

>> ./chaitin.cpp:221
>> _high_frequency_lrg = MIN2(double(OPTO_LRG_HIGH_FREQ), _cfg.get_outer_loop_frequency());
>> 
>> >> Here, _high_frequency_lrg type is float, so maybe has a loss of precision. when it be used: >> >>

>> ./coalesce.cpp:379
>> if( lrg._maxfreq >= _phc.high_frequency_lrg() ) {
>>    ...
>> }
>> 
>> Here, lrg._maxfreq type is double, so _high_frequency_lrg will be convert double again. But now, due to the loss of precision of _high_frequency_lrg, the conditions here may be true or false. >> >> There are two cases that I tested for SPECjvm2008 crypto.aes. >> case 1: >>

>> //chaitin.cpp:221
>> // fcvt.s.d $f0,$f0 #double->float
>> d = 16.994714324523816
>> f = 16.9947147
>> 
>> //coalesce.cpp:379
>> // fcvt.d.s $f0,$f0 #float->double
>> // fcmp.sle.d $fcc2,$f0,$f1
>> (gdb) i r fa0
>> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.994714736938477}
>> (gdb) i r fa1
>> fa1 {f = 0x0, d = 0x10} {f = -7.68722312e-24, d = 16.994714324523816}
>> 
>> >> case2: >>

>> //chaitin.cpp:221
>> // fcvt.s.d $f0,$f0
>> d = 16.996332681816536
>> f = 16.9963322
>> 
>> //coalesce.cpp
>> // fcvt.d.s $f0,$f0
>> // fcmp.sle.d $fcc2,$f0,$f1
>> (gdb) i r fa0
>> fa0 {f = 0x0, d = 0x10} {f = -1.08420217e-19, d = 16.996332168579102}
>> (gdb) i r fa1
>> fa1 {f = 0x0, d = 0x10} {f = -1.73570044e-14, d = 16.996332681816536}
>> 
>> >> The above two cases result in different block generation?case2 can insert new SpillCopyNodes), and resulting score on cryto.aes is fluctuate. >> >> This is a patch to fix this problem. Please help review it. >> >> Thanks, >> Sun Guoyun > > SUN Guoyun has updated the pull request incrementally with one additional commit since the last revision: > > 8298813: [C2] Converting double to float cause a loss of precision and resulting crypto.aes scores fluctuate We run the SPECjvm2008 benchmarks individually and gather different results (average, low/high, statistical significance, ...). I think the exact configuration differs depending on the platform and benchmark. Since this is part of our internal performance testing infrastructure, no public documentation is available. ------------- PR: https://git.openjdk.org/jdk/pull/11685 From eosterlund at openjdk.org Fri Dec 23 10:08:29 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 10:08:29 GMT Subject: RFR: 8299308: Add Assembler::testw register + immediate function for x86 Message-ID: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> The testw register + immediate instruction is missing in the x86 assembler. It's used by generational ZGC. Let's add it. ------------- Commit messages: - 8299308: Add Assembler::testw register + immediate function for x86 Changes: https://git.openjdk.org/jdk/pull/11772/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11772&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299308 Stats: 15 lines in 2 files changed: 15 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11772.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11772/head:pull/11772 PR: https://git.openjdk.org/jdk/pull/11772 From thartmann at openjdk.org Fri Dec 23 10:23:49 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 23 Dec 2022 10:23:49 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v5] In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 08:31:36 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > revert changes in array_copy_forward Looks good to me. Tests now pass (still running). Would still be good to know why the additional transform call is an issue. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9777 From qamai at openjdk.org Fri Dec 23 10:35:48 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 23 Dec 2022 10:35:48 GMT Subject: RFR: 8299308: Add Assembler::testw register + immediate function for x86 In-Reply-To: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> References: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> Message-ID: On Fri, 23 Dec 2022 10:00:28 GMT, Erik ?sterlund wrote: > The testw register + immediate instruction is missing in the x86 assembler. It's used by generational ZGC. Let's add it. Is it really needed, `testw r, i16` has the prefix 0x66 as a length changing prefix, which will bottle the predecoder. It is better to zero-extend to a 32-bit value and use `testl r, i32`. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11772 From eosterlund at openjdk.org Fri Dec 23 10:44:05 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 10:44:05 GMT Subject: RFR: 8299308: Add Assembler::testw register + immediate function for x86 In-Reply-To: References: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> Message-ID: <33MlxtobwB0pYzdu3TO5IMB2y5EpY9gsf8SDW-qMR0Q=.fdeb1f52-9d12-4375-9457-3e886964f57a@github.com> On Fri, 23 Dec 2022 10:32:53 GMT, Quan Anh Mai wrote: > Is it really needed, `testw r, i16` has the prefix 0x66 as a length changing prefix, which will bottle the predecoder. It is better to zero-extend to a 32-bit value and use `testl r, i32`. > > Thanks. Interesting. I did indeed run into strange perf issues with testw with Address operand and ultimately changed to testl. Where this code was invoked I thought I care a bit more about the footprint of the generated code and less about getting optimal performance. But maybe I should just nuke it and use testl there as well. ------------- PR: https://git.openjdk.org/jdk/pull/11772 From jwilhelm at openjdk.org Fri Dec 23 10:56:14 2022 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Fri, 23 Dec 2022 10:56:14 GMT Subject: RFR: Merge jdk20 Message-ID: Forwardport JDK 20 -> JDK 21 ------------- Commit messages: - Merge remote-tracking branch 'jdk20/master' into Merge_jdk20 - 8299237: add ArraysSupport.newLength test to a test group - 8299230: Use https: in links - 8299015: Ensure that HttpResponse.BodySubscribers.ofFile writes all bytes - 8299207: [Testbug] Add back test/jdk/java/awt/Graphics2D/DrawPrimitivesTest.java - 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears - 8299077: [REDO] JDK-4512626 Non-editable JTextArea provides no visual indication of keyboard focus The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=jdk&pr=11773&range=00.0 - jdk20: https://webrevs.openjdk.org/?repo=jdk&pr=11773&range=00.1 Changes: https://git.openjdk.org/jdk/pull/11773/files Stats: 572 lines in 15 files changed: 407 ins; 76 del; 89 mod Patch: https://git.openjdk.org/jdk/pull/11773.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11773/head:pull/11773 PR: https://git.openjdk.org/jdk/pull/11773 From yyang at openjdk.org Fri Dec 23 11:01:52 2022 From: yyang at openjdk.org (Yi Yang) Date: Fri, 23 Dec 2022 11:01:52 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v5] In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 10:21:29 GMT, Tobias Hartmann wrote: > Looks good to me. Tests now pass (still running). Would still be good to know why the additional transform call is an issue. Yes, that's somewhat strange. I'm looking more into that and will update in this PR later. ------------- PR: https://git.openjdk.org/jdk/pull/9777 From yyang at openjdk.org Fri Dec 23 11:05:56 2022 From: yyang at openjdk.org (Yi Yang) Date: Fri, 23 Dec 2022 11:05:56 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v5] In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 08:31:36 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > revert changes in array_copy_forward > > Looks good to me. Tests now pass (still running). Would still be good to know why the additional transform call is an issue. > > Yes, that's somewhat strange. I'm looking more into that and will update in this PR later(or file a new issue if it's really a bug). ------------- PR: https://git.openjdk.org/jdk/pull/9777 From eosterlund at openjdk.org Fri Dec 23 11:21:52 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 11:21:52 GMT Subject: RFR: 8299308: Add Assembler::testw register + immediate function for x86 In-Reply-To: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> References: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> Message-ID: On Fri, 23 Dec 2022 10:00:28 GMT, Erik ?sterlund wrote: > The testw register + immediate instruction is missing in the x86 assembler. It's used by generational ZGC. Let's add it. I'll proceed with testl. Cheers! ------------- PR: https://git.openjdk.org/jdk/pull/11772 From eosterlund at openjdk.org Fri Dec 23 11:21:55 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 11:21:55 GMT Subject: Withdrawn: 8299308: Add Assembler::testw register + immediate function for x86 In-Reply-To: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> References: <1jRKma2sJgUGnq0DZC7aN0AyB-rT5o_xQpXoyhTmE7Y=.f23a9971-5064-449a-9482-6adc4eaaea5e@github.com> Message-ID: On Fri, 23 Dec 2022 10:00:28 GMT, Erik ?sterlund wrote: > The testw register + immediate instruction is missing in the x86 assembler. It's used by generational ZGC. Let's add it. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/11772 From jwilhelm at openjdk.org Fri Dec 23 11:28:55 2022 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Fri, 23 Dec 2022 11:28:55 GMT Subject: Integrated: Merge jdk20 In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 10:48:19 GMT, Jesper Wilhelmsson wrote: > Forwardport JDK 20 -> JDK 21 This pull request has now been integrated. Changeset: 19ce23c6 Author: Jesper Wilhelmsson URL: https://git.openjdk.org/jdk/commit/19ce23c6459a452c8d3856b9ed96bfa54a8346ae Stats: 572 lines in 15 files changed: 407 ins; 76 del; 89 mod Merge ------------- PR: https://git.openjdk.org/jdk/pull/11773 From xlinzheng at openjdk.org Fri Dec 23 11:56:57 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 23 Dec 2022 11:56:57 GMT Subject: Integrated: 8299172: RISC-V: [TESTBUG] Fix stack alignment logic in jvmci RISCV64TestAssembler.java In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 06:13:04 GMT, Xiaolin Zheng wrote: > We observed a failure in JVMCI tests after `-ea -esa` turned on when running `./test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java`. > > Failure at the line [1]. > > > java.lang.AssertionError > at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitGrowStack(RISCV64TestAssembler.java:203) > at jdk.vm.ci.code.test.riscv64.RISCV64TestAssembler.emitCallPrologue(RISCV64TestAssembler.java:239) > ... > ... > > > The failure output has been attached to the JBS issue link. > > To be short, the stack alignment should align with `16`, and we can align with the logic in AArch64 [2] and x86_64 [3]. The x86_64 one is inside a recent change. > > Tested along with other patches, and the failed test passed. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/riscv64/RISCV64TestAssembler.java#L193 > [2] https://github.com/openjdk/jdk/blob/f56285c3613bb127e22f544bd4b461a0584e9d2a/test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/aarch64/AArch64TestAssembler.java#L273-L285 > [3] https://github.com/openjdk/jdk/commit/277f0c24a2e186166bfe70fc93ba79aec10585aa This pull request has now been integrated. Changeset: da75de31 Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/da75de31841e4b50477774e9efc4f554e1f3e4c0 Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod 8299172: RISC-V: [TESTBUG] Fix stack alignment logic in jvmci RISCV64TestAssembler.java Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/11751 From eosterlund at openjdk.org Fri Dec 23 12:48:50 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 12:48:50 GMT Subject: RFR: 8299323: Allow extended registers for cmpw Message-ID: The current instruction encoder for cmpw(Address, int16_t) on x64 does not allow rex extended registers. Generational ZGC needs to use this for arbitrary registers. Let's add support for it instead of asserting the input Address uses a subset of registers. ------------- Commit messages: - 8299323: Allow extended registers for cmpw Changes: https://git.openjdk.org/jdk/pull/11776/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11776&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299323 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11776.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11776/head:pull/11776 PR: https://git.openjdk.org/jdk/pull/11776 From eosterlund at openjdk.org Fri Dec 23 14:59:37 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 14:59:37 GMT Subject: RFR: 8299327: Allow super late barrier expansion of store barriers in C2 Message-ID: ZGC uses super late barrier expansion for load barriers in C2. Generational ZGC needs to do the same thing but for store barriers. ------------- Commit messages: - 8299327: Allow super late barrier expansion of store barriers in C2 Changes: https://git.openjdk.org/jdk/pull/11779/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11779&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299327 Stats: 23 lines in 8 files changed: 16 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/11779.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11779/head:pull/11779 PR: https://git.openjdk.org/jdk/pull/11779 From kvn at openjdk.org Fri Dec 23 16:43:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 23 Dec 2022 16:43:52 GMT Subject: RFR: 8299327: Allow super late barrier expansion of store barriers in C2 In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 14:29:53 GMT, Erik ?sterlund wrote: > ZGC uses super late barrier expansion for load barriers in C2. Generational ZGC needs to do the same thing but for store barriers. Seems fine. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11779 From kvn at openjdk.org Fri Dec 23 17:28:49 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 23 Dec 2022 17:28:49 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v5] In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 08:31:36 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > revert changes in array_copy_forward Last update looks good. `MergeMemNode::Identity()` will return base memory if there are no real merge of memories. The failed test [m14](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyAsLoadsStores.java#L181) copies 0 elements. In such case we don't generate loads/stores for forward copy. And calling `transform` on new `mm` will just return its base memory which is existing node. I think `ArrayCopyNode::Ideal()` did not take such case into account. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9777 From kvn at openjdk.org Fri Dec 23 17:32:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 23 Dec 2022 17:32:52 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v5] In-Reply-To: References: Message-ID: <8NZhrZxqQ3JFbwsHArnIPShM3JeoIYKvAeSv24v0Ktc=.ed0a1f67-3382-451e-b32c-256dcf4f628b@github.com> On Fri, 23 Dec 2022 17:24:44 GMT, Vladimir Kozlov wrote: > `MergeMemNode::Identity()` will return base memory if there are no real merge of memories. The failed test [m14](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyAsLoadsStores.java#L181) copies 0 elements. In such case we don't generate loads/stores for forward copy. And calling `transform` on new `mm` will just return its base memory which is existing node. > > I think `ArrayCopyNode::Ideal()` did not take such case into account. In general when you return new node in some `Ideal()` method you don't call `transform()` on it. You call `transform()` on new nodes used to construct a new returned node. ------------- PR: https://git.openjdk.org/jdk/pull/9777 From eosterlund at openjdk.org Fri Dec 23 17:59:48 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 17:59:48 GMT Subject: RFR: 8299327: Allow super late barrier expansion of store barriers in C2 In-Reply-To: References: Message-ID: On Fri, 23 Dec 2022 16:41:29 GMT, Vladimir Kozlov wrote: >> ZGC uses super late barrier expansion for load barriers in C2. Generational ZGC needs to do the same thing but for store barriers. > > Seems fine. Thanks for the review, @vnkozlov! ------------- PR: https://git.openjdk.org/jdk/pull/11779 From yyang at openjdk.org Mon Dec 26 02:18:58 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 26 Dec 2022 02:18:58 GMT Subject: Integrated: 8288204: GVN Crash: assert() failed: correct memory chain In-Reply-To: References: Message-ID: On Fri, 5 Aug 2022 15:23:45 GMT, Yi Yang wrote: > Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: > > ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) > > The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: > > The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: > > https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 > (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). > > There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). > > 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] > > > After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > The well-formed IR looks like this: > ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) > > Thanks for your patience. This pull request has now been integrated. Changeset: 04591595 Author: Yi Yang URL: https://git.openjdk.org/jdk/commit/04591595374e84cfbfe38d92bff4409105b28009 Stats: 95 lines in 3 files changed: 82 ins; 8 del; 5 mod 8288204: GVN Crash: assert() failed: correct memory chain Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9777 From duke at openjdk.org Wed Dec 28 23:17:54 2022 From: duke at openjdk.org (duke) Date: Wed, 28 Dec 2022 23:17:54 GMT Subject: Withdrawn: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong In-Reply-To: References: Message-ID: On Wed, 28 Sep 2022 19:04:07 GMT, Dhamoder Nalla wrote: > https://bugs.openjdk.org/browse/JDK-8286800 > > assert(real_LCA != NULL) in dump_real_LCA is not appropriate in bad graph scenario when both wrong_lca & early nodes are start nodes > > jvm!PhaseIdealLoop::dump_real_LCA(): > // Walk the idom chain up from early and wrong_lca and stop when they intersect. > while (!n1->is_Start() && !n2->is_Start()) { > ... > } > assert(real_LCA != NULL, "must always find an LCA"); > > Fix: replace assert with a console message This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/10472 From haosun at openjdk.org Thu Dec 29 09:16:48 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 29 Dec 2022 09:16:48 GMT Subject: RFR: JDK-8299158: Improve MD5 intrinsic on AArch64 In-Reply-To: References: Message-ID: <0ec3tyPRx7uC03NsQ-4v3igLSCqZ5_9bRAWAAwWkT-c=.09be3d8b-2b41-4e1e-a44c-f27d61da9df7@github.com> On Wed, 21 Dec 2022 01:52:32 GMT, Yi-Fan Tsai wrote: > There are two optimizations to reduce the length of the data path. > 1) Replace > > __ eorw(rscratch3, rscratch3, r4); > __ addw(rscratch3, rscratch3, rscratch1); > __ addw(rscratch3, rscratch3, rscratch4); > > with > > __ eorw(rscratch3, rscratch3, r4); > __ addw(rscratch4, rscratch4, rscratch1); > __ addw(rscratch3, rscratch3, rscratch4); > > The eorw and the first addw can be computed in parallel. > > 2) Replace > > __ eorw(rscratch2, r2, r3); > __ andw(rscratch3, rscratch2, r4); > __ eorw(rscratch3, rscratch3, r3); > > with > > __ andw(rscratch3, r2, r4); > __ bicw(rscratch4, r3, r4); > __ orrw(rscratch3, rscratch3, rscratch4); > > The transformation is based on the equation `((r2 ^ r3) & r4) ^ r3 == (r2 & r4) | (r3 & -r4)`. > The two subexpressions on RHS can be computed in parallel. > > Correctness proof > > r2 r3 r4 (r2 ^ r3) ((r2 ^ r3) & r4) LHS (r2 & r4) (r3 & -r4) RHS > 0 0 0 0 0 0 0 0 0 > 0 0 1 0 0 0 0 0 0 > 0 1 0 1 0 1 0 1 1 > 0 1 1 1 1 0 0 0 0 > 1 0 0 1 0 0 0 0 0 > 1 0 1 1 1 1 1 0 1 > 1 1 0 0 0 1 0 1 1 > 1 1 1 0 0 1 1 0 1 > > > The change has been tested by TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. > > The performance is measured on EC2 m6g instance (Graviton2) and shows 18-25% improvement. > Baseline > > Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 15 2989.149 ? 54.895 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 15 24.927 ? 0.002 ops/ms > MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2433.184 ? 74.616 ops/ms > MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 24.736 ? 0.002 ops/ms > > Optimized > > Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 15 3719.214 ? 23.087 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 15 31.280 ? 0.003 ops/ms > MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2874.308 ? 88.455 ops/ms > MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 31.014 ? 0.060 ops/ms LGTM (I'm not a Reviewer) ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.org/jdk/pull/11748