From eliu at openjdk.org Sat Oct 1 00:21:24 2022 From: eliu at openjdk.org (Eric Liu) Date: Sat, 1 Oct 2022 00:21:24 GMT Subject: RFR: 8294262: AArch64: compiler/vectorapi/TestReverseByteTransforms.java test failed on SVE machine [v2] In-Reply-To: References: Message-ID: > This test failed at cases test_reversebytes_short/int/long_transform2, which expected the ReversBytesV node, but nothing was finally found. On SVE system, we have a specific optimization, `ReverseBytesV (ReverseBytesV X MASK) MASK => X`, which eliminates both ReverseBytesV nodes. This optimization rule is specifically on hardware with native predicate support. See https://github.com/openjdk/jdk/pull/9623 for more details. > > As there is an SVE specific case TestReverseByteTransformsSVE.java, this patch simply marks TestReverseByteTransforms.java as non-SVE only. > > [TEST] > jdk/incubator/vector, hotspot/compiler/vectorapi pass on SVE machine Eric Liu has updated the pull request incrementally with one additional commit since the last revision: add comment Change-Id: I4c17256ff656528bbcfcacd2ee2380df6ae14bf1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10442/files - new: https://git.openjdk.org/jdk/pull/10442/files/cf4d967d..d3aa14e6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10442&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10442&range=00-01 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10442.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10442/head:pull/10442 PR: https://git.openjdk.org/jdk/pull/10442 From sviswanathan at openjdk.org Sat Oct 1 02:28:41 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Sat, 1 Oct 2022 02:28:41 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v7] In-Reply-To: References: Message-ID: On Tue, 20 Sep 2022 10:54:47 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch extends conversion optimizations added with [JDK-8287835](https://bugs.openjdk.org/browse/JDK-8287835) to optimize following floating point to integral conversions for X86 AVX2 targets:- >> * D2I , D2S, D2B, F2I , F2S, F2B >> >> In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature. >> * D2I, D2S, D2B >> >> Following are the JMH micro performance results with and without patch. >> >> System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz) >> >> BENCHMARK | SIZE | BASELINE (ops/ms) | WITHOPT (ops/ms) | PERF GAIN FACTOR >> -- | -- | -- | -- | -- >> VectorFPtoIntCastOperations.microDouble128ToByte128 | 1024 | 90.603 | 92.797 | 1.024215534 >> VectorFPtoIntCastOperations.microDouble128ToByte256 | 1024 | 81.909 | 82.3 | 1.00477359 >> VectorFPtoIntCastOperations.microDouble128ToByte512 | 1024 | 26.181 | 26.244 | 1.002406325 >> VectorFPtoIntCastOperations.microDouble128ToInteger128 | 1024 | 90.74 | 2537.958 | 27.96956138 >> VectorFPtoIntCastOperations.microDouble128ToInteger256 | 1024 | 81.586 | 2429.599 | 29.7796068 >> VectorFPtoIntCastOperations.microDouble128ToInteger512 | 1024 | 19.406 | 19.61 | 1.010512213 >> VectorFPtoIntCastOperations.microDouble128ToLong128 | 1024 | 91.723 | 90.754 | 0.989435583 >> VectorFPtoIntCastOperations.microDouble128ToShort128 | 1024 | 91.766 | 1984.577 | 21.62649565 >> VectorFPtoIntCastOperations.microDouble128ToShort256 | 1024 | 81.949 | 1940.599 | 23.68056962 >> VectorFPtoIntCastOperations.microDouble128ToShort512 | 1024 | 16.468 | 16.56 | 1.005586592 >> VectorFPtoIntCastOperations.microDouble256ToByte128 | 1024 | 163.331 | 3018.351 | 18.479964 >> VectorFPtoIntCastOperations.microDouble256ToByte256 | 1024 | 148.878 | 3082.034 | 20.70174237 >> VectorFPtoIntCastOperations.microDouble256ToByte512 | 1024 | 50.108 | 51.629 | 1.030354434 >> VectorFPtoIntCastOperations.microDouble256ToInteger128 | 1024 | 159.805 | 4619.421 | 28.90661118 >> VectorFPtoIntCastOperations.microDouble256ToInteger256 | 1024 | 143.876 | 4649.642 | 32.31700909 >> VectorFPtoIntCastOperations.microDouble256ToInteger512 | 1024 | 38.127 | 38.188 | 1.001599916 >> VectorFPtoIntCastOperations.microDouble256ToLong128 | 1024 | 160.322 | 162.442 | 1.013223388 >> VectorFPtoIntCastOperations.microDouble256ToLong256 | 1024 | 141.252 | 143.01 | 1.012445841 >> VectorFPtoIntCastOperations.microDouble256ToShort128 | 1024 | 157.717 | 3757.471 | 23.82413437 >> VectorFPtoIntCastOperations.microDouble256ToShort256 | 1024 | 143.876 | 3830.971 | 26.62689399 >> VectorFPtoIntCastOperations.microDouble256ToShort512 | 1024 | 32.061 | 32.911 | 1.026511962 >> VectorFPtoIntCastOperations.microFloat128ToByte128 | 1024 | 146.599 | 4002.967 | 27.30555461 >> VectorFPtoIntCastOperations.microFloat128ToByte256 | 1024 | 136.99 | 3938.799 | 28.75245638 >> VectorFPtoIntCastOperations.microFloat128ToByte512 | 1024 | 51.561 | 50.284 | 0.975233219 >> VectorFPtoIntCastOperations.microFloat128ToInteger128 | 1024 | 5933.565 | 5361.472 | 0.903583596 >> VectorFPtoIntCastOperations.microFloat128ToInteger256 | 1024 | 5079.564 | 5062.046 | 0.996551279 >> VectorFPtoIntCastOperations.microFloat128ToInteger512 | 1024 | 37.101 | 38.419 | 1.035524649 >> VectorFPtoIntCastOperations.microFloat128ToLong128 | 1024 | 145.863 | 145.362 | 0.99656527 >> VectorFPtoIntCastOperations.microFloat128ToLong256 | 1024 | 131.159 | 133.154 | 1.015210546 >> VectorFPtoIntCastOperations.microFloat128ToShort128 | 1024 | 145.966 | 4150.039 | 28.4315457 >> VectorFPtoIntCastOperations.microFloat128ToShort256 | 1024 | 134.703 | 4566.589 | 33.90116775 >> VectorFPtoIntCastOperations.microFloat128ToShort512 | 1024 | 31.878 | 30.867 | 0.968285338 >> VectorFPtoIntCastOperations.microFloat256ToByte128 | 1024 | 237.841 | 6292.051 | 26.4548627 >> VectorFPtoIntCastOperations.microFloat256ToByte256 | 1024 | 222.041 | 6292.748 | 28.34047766 >> VectorFPtoIntCastOperations.microFloat256ToByte512 | 1024 | 92.073 | 88.981 | 0.966417951 >> VectorFPtoIntCastOperations.microFloat256ToInteger128 | 1024 | 11471.121 | 10269.636 | 0.895260019 >> VectorFPtoIntCastOperations.microFloat256ToInteger256 | 1024 | 10729.816 | 10105.92 | 0.941853989 >> VectorFPtoIntCastOperations.microFloat256ToInteger512 | 1024 | 68.328 | 70.005 | 1.024543379 >> VectorFPtoIntCastOperations.microFloat256ToLong128 | 1024 | 247.101 | 248.571 | 1.005948984 >> VectorFPtoIntCastOperations.microFloat256ToLong256 | 1024 | 225.74 | 223.987 | 0.992234429 >> VectorFPtoIntCastOperations.microFloat256ToLong512 | 1024 | 76.39 | 76.187 | 0.997342584 >> VectorFPtoIntCastOperations.microFloat256ToShort128 | 1024 | 233.196 | 8202.179 | 35.17289748 >> VectorFPtoIntCastOperations.microFloat256ToShort256 | 1024 | 220.75 | 7781.073 | 35.24834881 >> VectorFPtoIntCastOperations.microFloat256ToShort512 | 1024 | 58.143 | 55.633 | 0.956830573 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8288043: Adding descriptive comments. I am still going through the c2_MacroAssembler_x86.cpp changes. Hopefully early next week will finish the review. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4611: > 4609: void C2_MacroAssembler::vector_castF2L_evex(XMMRegister dst, XMMRegister src, XMMRegister xtmp1, XMMRegister xtmp2, > 4610: KRegister ktmp1, KRegister ktmp2, AddressLiteral double_sign_flip, > 4611: Register rscratch, int vec_enc) { Need an assert here: assert(rscratch != noreg || always_reachable(double_sign_flip), "missing"); src/hotspot/cpu/x86/matcher_x86.hpp line 196: > 194: case Op_VectorCastF2X: // fall through > 195: case Op_VectorCastD2X: { > 196: return is_subword_type(ety) ? 35 : 30; This needs to be more selective. It is not that in all cases F2X and D2X need lot of instructions e.g. F2D, D2F are single instruction. ------------- PR: https://git.openjdk.org/jdk/pull/9748 From iveresov at openjdk.org Sat Oct 1 05:12:44 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 1 Oct 2022 05:12:44 GMT Subject: RFR: 8242115: C2 SATB barriers are not safepoint-safe In-Reply-To: References: Message-ID: On Fri, 30 Sep 2022 22:35:42 GMT, Vladimir Ivanov wrote: > Did you run any performance tests? Yes. No visible regressions. It very likely hits the cache. ------------- PR: https://git.openjdk.org/jdk/pull/10517 From iveresov at openjdk.org Sat Oct 1 05:12:44 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 1 Oct 2022 05:12:44 GMT Subject: RFR: 8242115: C2 SATB barriers are not safepoint-safe In-Reply-To: References: Message-ID: On Fri, 30 Sep 2022 21:22:47 GMT, Igor Veresov wrote: > Implement load pinning, use it for g1 pre-value loads, add verification that the load control dependency is kept and that there are no safepoints between the load of the pre-value and the marking check (with the exception of the CAS intrinsics where it's permitted). Thanks for the reviews, Vladimirs! ------------- PR: https://git.openjdk.org/jdk/pull/10517 From qamai at openjdk.org Sat Oct 1 06:36:04 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 1 Oct 2022 06:36:04 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v3] In-Reply-To: References: Message-ID: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 30 commits: - typo - limit coalescing - Merge branch 'master' into peephole - Merge branch 'master' into peephole - Merge branch 'master' into peephole - Merge branch 'master' into peephole - Merge branch 'master' into peephole - Merge branch 'master' into peephole - some fix - add benchmark - ... and 20 more: https://git.openjdk.org/jdk/compare/3419363e...8524aaa9 ------------- Changes: https://git.openjdk.org/jdk/pull/8025/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=02 Stats: 1004 lines in 21 files changed: 860 ins; 24 del; 120 mod Patch: https://git.openjdk.org/jdk/pull/8025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8025/head:pull/8025 PR: https://git.openjdk.org/jdk/pull/8025 From qamai at openjdk.org Sat Oct 1 06:36:08 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 1 Oct 2022 06:36:08 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 12:16:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'master' into peephole > - some fix > - add benchmark > - Merge branch 'master' into peephole > - refactor > - fix? > - refactor > - attempt > - attempt > - build fix > - ... and 13 more: https://git.openjdk.org/jdk/compare/72bd41b8...78b4a3f2 Thanks a lot for your testing, can you run the tests again, please? ------------- PR: https://git.openjdk.org/jdk/pull/8025 From eosterlund at openjdk.org Sat Oct 1 08:23:27 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Sat, 1 Oct 2022 08:23:27 GMT Subject: RFR: 8242115: C2 SATB barriers are not safepoint-safe In-Reply-To: References: Message-ID: On Fri, 30 Sep 2022 21:22:47 GMT, Igor Veresov wrote: > Implement load pinning, use it for g1 pre-value loads, add verification that the load control dependency is kept and that there are no safepoints between the load of the pre-value and the marking check (with the exception of the CAS intrinsics where it's permitted). Since I introduced the "pinned" control dependency for ZGC loads back when the barrier was in the sea of nodes, I'd like to point out that I also had a patch with some of these fixes, but kind of gave up and did a table flip when the loads started to float across safepoints after the matching to mach nodes, but only on AArch64, during GCM and instruction scheduling IIRC. I had verification code (but implemented differently when an edge is updated in the graph) that said all is fine in my sea of nodes but then the mach nodes started floating around after that messing up the order wrt safepoints anyway. That's why ZGC stopped trying to make the pinned control dependency work, and expands the barriers in assembly code instead. I would not be surprised if we still have the same issue here, and that the sea of nodes verification similarly won't catch it. Because I got that T-shirt already and this all looks fairly familiar. I hope it doesn't have those issues. ------------- PR: https://git.openjdk.org/jdk/pull/10517 From qamai at openjdk.org Sat Oct 1 10:22:42 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 1 Oct 2022 10:22:42 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: check index ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8025/files - new: https://git.openjdk.org/jdk/pull/8025/files/8524aaa9..d5928f02 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/8025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8025/head:pull/8025 PR: https://git.openjdk.org/jdk/pull/8025 From dnsimon at openjdk.org Sat Oct 1 11:24:27 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Sat, 1 Oct 2022 11:24:27 GMT Subject: Integrated: 8294676: [JVMCI] InstalledCode.deoptimize(false) should not touch address field In-Reply-To: References: Message-ID: <4A_qyfW3OeojwcdSFPiBJuV1hk2Yk_7CpcYYvPnYfK4=.bbdcd16d-b2c3-4da2-8c8c-d154defe079c@github.com> On Fri, 30 Sep 2022 16:32:25 GMT, Doug Simon wrote: > The ability to make an nmethod non-entrant via an `InstalledCode` object was added by [JDK-8292917](https://bugs.openjdk.org/browse/JDK-8292917). However, the make non-entrant path clears `InstalledCode.address`. > This breaks the connection between an `InstalledCode` object and the nmethod. This makes it impossible to subsequently deoptimize the code via the `InstalledCode` object as shown below: > > > InstalledCode tier1Code = ...; > // Make tier1Code non-entrant (e.g. it is being replaced by a more optimized version, tier2Code) > // but do not deoptimize it as it is still valid. Let current executions of tier1Code complete. > tier1Code.invalidate(false); > > ... > > // Some assumption used in compiling tier1Code is about to be invalidated > // so it must now be deoptimized. > tier1Code.invalidate(true) > > > Prior to this PR, the last statement above does nothing which leads to `tier1Code` incorrectly being executed as soon as the assumption in question is invalidated. This pull request has now been integrated. Changeset: b8b9b97a Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/b8b9b97a1a3e07777da2e39ac4779ef7b77434c7 Stats: 312 lines in 8 files changed: 151 ins; 126 del; 35 mod 8294676: [JVMCI] InstalledCode.deoptimize(false) should not touch address field Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/10514 From iveresov at openjdk.org Sat Oct 1 18:21:17 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 1 Oct 2022 18:21:17 GMT Subject: RFR: 8242115: C2 SATB barriers are not safepoint-safe In-Reply-To: References: Message-ID: On Fri, 30 Sep 2022 21:22:47 GMT, Igor Veresov wrote: > Implement load pinning, use it for g1 pre-value loads, add verification that the load control dependency is kept and that there are no safepoints between the load of the pre-value and the marking check (with the exception of the CAS intrinsics where it's permitted). Thanks for the hint! I'll push these changes to the front-end and take a harder look at the GCM. ------------- PR: https://git.openjdk.org/jdk/pull/10517 From eosterlund at openjdk.org Sat Oct 1 18:36:19 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Sat, 1 Oct 2022 18:36:19 GMT Subject: RFR: 8242115: C2 SATB barriers are not safepoint-safe In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 18:17:21 GMT, Igor Veresov wrote: > Thanks for the hint! I'll push these changes to the front-end and take a harder look at the GCM. Thanks Igor! ------------- PR: https://git.openjdk.org/jdk/pull/10517 From qamai at openjdk.org Sun Oct 2 11:50:35 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 2 Oct 2022 11:50:35 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v5] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: fast path for negative divisors ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/156f65c0..fccfe7ec Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=03-04 Stats: 100 lines in 7 files changed: 79 ins; 6 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Sun Oct 2 12:06:06 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 2 Oct 2022 12:06:06 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v6] In-Reply-To: References: Message-ID: <7EktXp6uK61LePDSjeFzpSl-v88I0dymsBjZN3xeGXw=.e6255665-23aa-415d-8fca-c26880cd704e@github.com> > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: whitespace, mistaken added ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/fccfe7ec..53c07784 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=04-05 Stats: 45 lines in 2 files changed: 0 ins; 44 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From svkamath at openjdk.org Mon Oct 3 05:45:55 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Mon, 3 Oct 2022 05:45:55 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v12] In-Reply-To: References: Message-ID: > 8289552: Make intrinsic conversions between bit representations of half precision values and floats Smita Kamath has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: - Updated instruction definition - Merge branch 'master' - Addressed review comment to update test case - Addressed review comments - Merge branch 'master' of https://git.openjdk.java.net/jdk into JDK-8289552 - Addressed review comments - Added missing parantheses - Addressed review comments, updated microbenchmark - Updated copyright comment - Updated test cases as per review comments - ... and 3 more: https://git.openjdk.org/jdk/compare/ac2b491b...69999ce4 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9781/files - new: https://git.openjdk.org/jdk/pull/9781/files/8ccc0657..69999ce4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9781&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9781&range=10-11 Stats: 14672 lines in 427 files changed: 7255 ins; 5491 del; 1926 mod Patch: https://git.openjdk.org/jdk/pull/9781.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9781/head:pull/9781 PR: https://git.openjdk.org/jdk/pull/9781 From tholenstein at openjdk.org Mon Oct 3 07:17:53 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 07:17:53 GMT Subject: RFR: JDK-8294567: IGV: IllegalStateException in search In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 10:18:56 GMT, Roberto Casta?eda Lozano wrote: >> When searching for a node, IGV first looks in the current open graph. If it can find the node here everything works fine. >> >> # Problem >> If if cannot find the node if uses the `searchForward` and `searchBackward` in the `EditorInputGraphProvider` to search in the other graphs. It crashes here because `editor.isOpened()` can only be called from the `EventDispatchThread` >> >> # Solution >> The calls to `editor.isOpened()` are not needed because `editor != null` already means that it is open. So just remove all 4 calls to `editor.isOpened()`. > > Looks good! Thanks you @robcasloz , @chhagedorn and @vnkozlov for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10483 From tholenstein at openjdk.org Mon Oct 3 07:17:54 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 07:17:54 GMT Subject: Integrated: JDK-8294567: IGV: IllegalStateException in search In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 09:14:23 GMT, Tobias Holenstein wrote: > When searching for a node, IGV first looks in the current open graph. If it can find the node here everything works fine. > > # Problem > If if cannot find the node if uses the `searchForward` and `searchBackward` in the `EditorInputGraphProvider` to search in the other graphs. It crashes here because `editor.isOpened()` can only be called from the `EventDispatchThread` > > # Solution > The calls to `editor.isOpened()` are not needed because `editor != null` already means that it is open. So just remove all 4 calls to `editor.isOpened()`. This pull request has now been integrated. Changeset: 6e8f0387 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/6e8f0387d64c9620bdd4c8913b2f41eade805348 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod 8294567: IGV: IllegalStateException in search Reviewed-by: rcastanedalo, chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10483 From chagedorn at openjdk.org Mon Oct 3 07:32:31 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Oct 2022 07:32:31 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong In-Reply-To: References: Message-ID: <-R-7utOPOqW-m_INYgu4F3Xc1Becu6rQEvoml1E2d0A=.d93f53d8-3ca3-4f14-9555-3e9a5744771a@github.com> On Wed, 28 Sep 2022 19:04:07 GMT, Dhamoder Nalla wrote: > https://bugs.openjdk.org/browse/JDK-8286800 > > assert(real_LCA != NULL) in dump_real_LCA is not appropriate in bad graph scenario when both wrong_lca & early nodes are start nodes > > jvm!PhaseIdealLoop::dump_real_LCA(): > // Walk the idom chain up from early and wrong_lca and stop when they intersect. > while (!n1->is_Start() && !n2->is_Start()) { > ... > } > assert(real_LCA != NULL, "must always find an LCA"); > > Fix: replace assert with a console message Hi @dhanalla I don't think we should remove this assertion. We should always be able to find an LCA with given `early` and `wrong_lca`. Hitting the assertion indicates that there is a mistake in the way `dump_real_LCA()` finds the LCA. I therefore suggest to fix the algorithm instead of disabling the assert. Back there when I wrote that code, I've tried to simultaneously walk the idom chains from `early` and `wrong_lca` to be more efficient. I don't think this optimization was necessary looking at the added complexity and given that we are in debug code and about to fail anyways. I'm currently working on [JDK-8285835](https://bugs.openjdk.org/browse/JDK-8285835) where I hit the very same assertion failure. In this process, I've fixed `dump_real_LCA()` and made it easier. I'm also printing less information (I think the idom dumps are too verbose at the moment). If you like, I could take this bug over. Otherwise, I can also follow up with an RFE to get the improved printing in separately. Cheers, Christian ------------- PR: https://git.openjdk.org/jdk/pull/10472 From rcastanedalo at openjdk.org Mon Oct 3 07:41:18 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 3 Oct 2022 07:41:18 GMT Subject: RFR: 8294236: [IR Framework] CPU preconditions are overriden by regular preconditions [v3] In-Reply-To: References: Message-ID: <71Bvdgthz0mhh78NOGxKZhRW4q_cTXApVOWxmCm6Tbg=.055a9a10-a175-450f-af44-d876fc1d04e2@github.com> On Thu, 29 Sep 2022 10:36:24 GMT, Roberto Casta?eda Lozano wrote: >> This changeset ensures that all preconditions of a IR test (`applyIf`, `applyIfCPUFeature`, etc.) are evaluated as a logical conjunction to determine whether the test's IR check should be applied. >> >> #### Testing >> >> - tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64). >> - IR framework tests in `test/hotspot/jtreg/testlibrary_tests/ir_framework` (linux-x64). > > Roberto Casta?eda Lozano has updated the pull request incrementally with two additional commits since the last revision: > > - Use else-if, remove single-use Boolean variables, factor out duplicated code > - Clarify comment in test case Thanks for reviewing, Christian and Vladimir! ------------- PR: https://git.openjdk.org/jdk/pull/10402 From rcastanedalo at openjdk.org Mon Oct 3 07:43:24 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 3 Oct 2022 07:43:24 GMT Subject: Integrated: 8294236: [IR Framework] CPU preconditions are overriden by regular preconditions In-Reply-To: References: Message-ID: <2fo1vJGbLQkeVC_UrjoEXcxDFqXGFnfyj5UNt_Wl3w8=.beeb2840-518d-4276-951b-9e903e809fdd@github.com> On Fri, 23 Sep 2022 07:55:15 GMT, Roberto Casta?eda Lozano wrote: > This changeset ensures that all preconditions of a IR test (`applyIf`, `applyIfCPUFeature`, etc.) are evaluated as a logical conjunction to determine whether the test's IR check should be applied. > > #### Testing > > - tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64). > - IR framework tests in `test/hotspot/jtreg/testlibrary_tests/ir_framework` (linux-x64). This pull request has now been integrated. Changeset: 5fe837a3 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/5fe837a35e03dc7a1a5f7fc8a2d0350573f4b81f Stats: 155 lines in 3 files changed: 90 ins; 64 del; 1 mod 8294236: [IR Framework] CPU preconditions are overriden by regular preconditions Reviewed-by: chagedorn, pli, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10402 From qamai at openjdk.org Mon Oct 3 08:36:30 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 3 Oct 2022 08:36:30 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v11] In-Reply-To: References: <_Ghl2lsnrBhiWvVD3TMiwGo6SfQLl6idczb1QVqLa_I=.7cfa48e2-2987-43e0-a689-0e3462e4d270@github.com> Message-ID: On Fri, 30 Sep 2022 10:04:34 GMT, Quan Anh Mai wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comment to update test case > > src/hotspot/cpu/x86/x86.ad line 3674: > >> 3672: %} >> 3673: >> 3674: instruct convF2HF_mem_reg(memory mem, regF src, kReg ktmp, rRegI rtmp) %{ > > You can use `kmovwl` instead which will relax the avx512bw constraint, however, you will need avx512vl for `evcvtps2ph`. Thanks. Rethink about it, you can get 0x01 by right shifting k0 to the right - `kshiftrw(ktmp, k0, 15)` ------------- PR: https://git.openjdk.org/jdk/pull/9781 From tholenstein at openjdk.org Mon Oct 3 11:35:11 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 11:35:11 GMT Subject: RFR: JDK-8294529 : IGV: Highlight the current graphs in the Outline [v3] In-Reply-To: <9Shl-1KCi-8SlKaBeZ_QfPOSauPY8a6H-umN36EF41I=.1e0d2e68-375d-42fe-a05f-a1620b4fc578@github.com> References: <9Shl-1KCi-8SlKaBeZ_QfPOSauPY8a6H-umN36EF41I=.1e0d2e68-375d-42fe-a05f-a1620b4fc578@github.com> Message-ID: > # Problem > IGV marks the graph(s) of the current active `EditorTopComponent` as selected. Here is an example of a difference graph between the `Inremental Boxing Inline` and `Before CountedLoop` graphs: > before1 > > The selection can be changed when the user selects another graph (without opening in). This is needed for example to selected graphs that the user wants to delete: > before2 > > # Proposed Solution > Change the font of the open graph to `bold` and make the icon darker. Do the same for the folder where the graph is located. Now the user can select different graphs and still see which graphs are currently opened in the active `EditorTopComponent`. > new_highlighting > > # Implementation Details > Introduce a new `selected` variable in `FolderNode` and `GraphNode`. Whenever the currently viewed graphs change, the function `changed(InputGraphProvider lastProvider)` in `OutlineTopComponent` is called. Here we set the `FolderNode` and `GraphNode` to be (un)selected. This fires a `fireDisplayNameChange` and a `fireIconChange` which updates the text/icons. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: code style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10468/files - new: https://git.openjdk.org/jdk/pull/10468/files/43d4f9f8..d904a0d7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10468&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10468&range=01-02 Stats: 10 lines in 3 files changed: 4 ins; 5 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10468.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10468/head:pull/10468 PR: https://git.openjdk.org/jdk/pull/10468 From tholenstein at openjdk.org Mon Oct 3 11:35:15 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 11:35:15 GMT Subject: RFR: JDK-8294529 : IGV: Highlight the current graphs in the Outline [v2] In-Reply-To: References: <9Shl-1KCi-8SlKaBeZ_QfPOSauPY8a6H-umN36EF41I=.1e0d2e68-375d-42fe-a05f-a1620b4fc578@github.com> Message-ID: On Fri, 30 Sep 2022 10:36:16 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> escapeHTML in getHtmlDisplayName() > > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/FolderNode.java line 140: > >> 138: } >> 139: >> 140: private boolean selected = false; > > You should move this field up to the other field declarations. done > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/GraphNode.java line 63: > >> 61: } >> 62: >> 63: private boolean selected = false; > > You should move this field up to the other field declarations. done > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/OutlineTopComponent.java line 229: > >> 227: >> 228: private GraphNode[] selectedGraphs = new GraphNode[0]; >> 229: private final Set selectedFolders = new HashSet<>(); > > You should move these fields up to the other field declarations. done ------------- PR: https://git.openjdk.org/jdk/pull/10468 From tholenstein at openjdk.org Mon Oct 3 11:36:46 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 11:36:46 GMT Subject: RFR: JDK-8294529 : IGV: Highlight the current graphs in the Outline [v2] In-Reply-To: References: <9Shl-1KCi-8SlKaBeZ_QfPOSauPY8a6H-umN36EF41I=.1e0d2e68-375d-42fe-a05f-a1620b4fc578@github.com> Message-ID: On Fri, 30 Sep 2022 08:09:15 GMT, Roberto Casta?eda Lozano wrote: > I am not sure whether this is a regression introduced in this PR or a previous issue (in which case it should be reported and addressed separately). thanks for catching that bug. I filed https://bugs.openjdk.org/browse/JDK-8294564 for that bug. Since it is not directly related to this PR, if will integrate and then do the fix in JDK-8294564. thanks @robcasloz and @chhagedorn for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10468 From tholenstein at openjdk.org Mon Oct 3 11:39:43 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 11:39:43 GMT Subject: Integrated: JDK-8294529 : IGV: Highlight the current graphs in the Outline In-Reply-To: <9Shl-1KCi-8SlKaBeZ_QfPOSauPY8a6H-umN36EF41I=.1e0d2e68-375d-42fe-a05f-a1620b4fc578@github.com> References: <9Shl-1KCi-8SlKaBeZ_QfPOSauPY8a6H-umN36EF41I=.1e0d2e68-375d-42fe-a05f-a1620b4fc578@github.com> Message-ID: On Wed, 28 Sep 2022 14:44:10 GMT, Tobias Holenstein wrote: > # Problem > IGV marks the graph(s) of the current active `EditorTopComponent` as selected. Here is an example of a difference graph between the `Inremental Boxing Inline` and `Before CountedLoop` graphs: > before1 > > The selection can be changed when the user selects another graph (without opening in). This is needed for example to selected graphs that the user wants to delete: > before2 > > # Proposed Solution > Change the font of the open graph to `bold` and make the icon darker. Do the same for the folder where the graph is located. Now the user can select different graphs and still see which graphs are currently opened in the active `EditorTopComponent`. > new_highlighting > > # Implementation Details > Introduce a new `selected` variable in `FolderNode` and `GraphNode`. Whenever the currently viewed graphs change, the function `changed(InputGraphProvider lastProvider)` in `OutlineTopComponent` is called. Here we set the `FolderNode` and `GraphNode` to be (un)selected. This fires a `fireDisplayNameChange` and a `fireIconChange` which updates the text/icons. This pull request has now been integrated. Changeset: ccc1d316 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/ccc1d3169691d066c08e294f5d989b007bfab114 Stats: 67 lines in 5 files changed: 60 ins; 0 del; 7 mod 8294529: IGV: Highlight the current graphs in the Outline Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10468 From qamai at openjdk.org Mon Oct 3 12:18:44 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 3 Oct 2022 12:18:44 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v7] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: revert backend changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/53c07784..0cb30b8d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=05-06 Stats: 435 lines in 12 files changed: 172 ins; 217 del; 46 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From shade at openjdk.org Mon Oct 3 13:04:26 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 3 Oct 2022 13:04:26 GMT Subject: RFR: 8288302: Shenandoah: SIGSEGV in vm maybe related to jit compiling xerces In-Reply-To: <4h2hq0pbUiyUWAA4ddq8Z0FaQ22q_uQk2E93ci7fJ1E=.d87c892f-a084-470e-a91d-acd6ab86b612@github.com> References: <4h2hq0pbUiyUWAA4ddq8Z0FaQ22q_uQk2E93ci7fJ1E=.d87c892f-a084-470e-a91d-acd6ab86b612@github.com> Message-ID: On Thu, 29 Sep 2022 14:31:16 GMT, Roland Westrelin wrote: > During igvn, at a heap stable test, a dominating heap stable test is > found that can be used to optimize out the current one. But this area > of the graph is actually dying and the current test has lost one of > its projection already, something the logic doesn't expect and which > causes the crash. The fix I propose is to simply detect that the heap > stable test is dying and skip the transformation. All right, looks reasonable. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10491 From tholenstein at openjdk.org Mon Oct 3 13:40:01 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 13:40:01 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" Message-ID: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> "Difference to current graph" opens a new window with a difference graph in IGV. Unfortunately, it was throwing a `java.lang.IllegalArgumentException` # Overview In IGV, for every opened graph, that is not a difference graph, the user can right-click on any other graph in the Outline and select "Difference to current graph". This opens a difference graph showing the difference between the currently opened graph and the selected graph. difference to current If the current graph is already a difference graph, the function is disabled. Same if the selected graph is identical to the opened one. difference disabled # Implementation The problem was that the difference graph did not keep track of which two `InputGraphs` it was based on. Therefore the functions `getFirstGraph()` and `getSecondGraph` in `DiagramViewModel` did not work properly. Now, `InputGraph` keeps track of the first and second `InputGraph` if it is a difference graph (`isDiffGraph`). And `getFirstGraph()` and `getSecondGraph` are updated to return the right `InputGraph`s for difference graphs. --------- ### Progress - [ ] Change must be properly reviewed (1 review required, with at least 1 [Reviewer](https://openjdk.org/bylaws#reviewer)) - [x] Change must not contain extraneous whitespace - [x] Commit message must refer to an issue ### Reviewing
Using git Checkout this PR locally: \ `$ git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533` \ `$ git checkout pull/10533` Update a local copy of the PR: \ `$ git checkout pull/10533` \ `$ git pull https://git.openjdk.org/jdk pull/10533/head`
Using Skara CLI tools Checkout this PR locally: \ `$ git pr checkout 10533` View PR using the GUI difftool: \ `$ git pr show -t 10533`
Using diff file Download this PR as a diff file: \ https://git.openjdk.org/jdk/pull/10533.diff
------------- Commit messages: - no difference graph for same graph - inline hasEditor() - bug fix isOpen() - Update OutlineTopComponent.java - safe firstGraph and secondGraph in Inputgraph for diff graphs - new GraphViewerImplementation view_difference() - findEditorForGraph - no diffgraph of a diffgraph - IGV: IllegalArgumentException for "Difference to current graph" Changes: https://git.openjdk.org/jdk/pull/10533/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10533&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294564 Stats: 174 lines in 9 files changed: 94 ins; 21 del; 59 mod Patch: https://git.openjdk.org/jdk/pull/10533.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533 PR: https://git.openjdk.org/jdk/pull/10533 From rcastanedalo at openjdk.org Mon Oct 3 14:03:17 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 3 Oct 2022 14:03:17 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" In-Reply-To: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> Message-ID: <15oIUcDeADWwyAsOT8JlBj6M36TKawXipBO_SV93jao=.6cf59ebd-88eb-48bf-a587-1dd6f1f16802@github.com> On Mon, 3 Oct 2022 13:05:52 GMT, Tobias Holenstein wrote: > "Difference to current graph" opens a new window with a difference graph in IGV. Unfortunately, it was throwing a `java.lang.IllegalArgumentException` > > # Overview > In IGV, for every opened graph, that is not a difference graph, the user can right-click on any other graph in the Outline and select "Difference to current graph". This opens a difference graph showing the difference between the currently opened graph and the selected graph. > difference to current > > If the current graph is already a difference graph, the function is disabled. Same if the selected graph is identical to the opened one. > difference disabled > > # Implementation > The problem was that the difference graph did not keep track of which two `InputGraphs` it was based on. Therefore the functions `getFirstGraph()` and `getSecondGraph` in `DiagramViewModel` did not work properly. Now, `InputGraph` keeps track of the first and second `InputGraph` if it is a difference graph (`isDiffGraph`). And `getFirstGraph()` and `getSecondGraph` are updated to return the right `InputGraph`s for difference graphs. > > --------- > ### Progress > - [ ] Change must be properly reviewed (1 review required, with at least 1 [Reviewer](https://openjdk.org/bylaws#reviewer)) > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > > > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533` \ > `$ git checkout pull/10533` > > Update a local copy of the PR: \ > `$ git checkout pull/10533` \ > `$ git pull https://git.openjdk.org/jdk pull/10533/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 10533` > > View PR using the GUI difftool: \ > `$ git pr show -t 10533` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/10533.diff > >
Looks good, thanks for fixing this! A minor nit: please rename `view_difference` with `viewDifference`for style consistency. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/10533 From smonteith at openjdk.org Mon Oct 3 14:08:51 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Mon, 3 Oct 2022 14:08:51 GMT Subject: RFR: 8294194: Create intrinsics compress and expand Message-ID: The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. Running on an SVE2 enabled system, I ran the following benchmarks: org.openjdk.bench.java.lang.Integers org.openjdk.bench.java.lang.Longs The time for each operation reduced to 56% to 72% of the original run time: Benchmark Result error Unit % against non-SVE2 Integers.expand 2.106 0.011 us/op Integers.expand-SVE 1.431 0.009 us/op 67.95% Longs.expand 2.606 0.006 us/op Longs.expand-SVE 1.46 0.003 us/op 56.02% Integers.compress 1.982 0.004 us/op Integers.compress-SVE 1.427 0.003 us/op 72.00% Longs.compress 2.501 0.002 us/op Longs.compress-SVE 1.441 0.003 us/op 57.62% ------------- Commit messages: - 8294194: Create intrinsics compress and expand Changes: https://git.openjdk.org/jdk/pull/10537/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294194 Stats: 85 lines in 2 files changed: 81 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10537.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10537/head:pull/10537 PR: https://git.openjdk.org/jdk/pull/10537 From rcastanedalo at openjdk.org Mon Oct 3 14:11:44 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 3 Oct 2022 14:11:44 GMT Subject: RFR: 8290964: C2 compilation fails with assert "non-reduction loop contains reduction nodes" Message-ID: This changeset removes the [reduction information consistency assertion](https://github.com/openjdk/jdk/blob/46633e644a8ab94ceb75803bd40739214f8a60e8/src/hotspot/share/opto/superword.cpp#L2458-L2459) in `SuperWord::output()`, which has proven to report too many false positives (inconsistencies that do not lead to miscompilation) since its introduction by [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622), despite the efforts to reduce the false positive rate in [JDK-8286177](https://bugs.openjdk.org/browse/JDK-8286177). During the time the assertion has been enabled in our internal CI system, no true positive case (reported inconsistencies actually leading to a miscompilation or a crash) has been observed. An alternative solution would be to wait for [JDK-8287087](https://bugs.openjdk.org/browse/JDK-8287087) (work in progress), which proposes a refactoring of the reduction analysis logic that eliminates by construction the need for this assertion. This changeset proposes removing the assertion earlier, to reduce noise in test environments. #### Testing - hs-tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64). ------------- Commit messages: - Remove noisy assertion Changes: https://git.openjdk.org/jdk/pull/10535/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10535&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290964 Stats: 21 lines in 3 files changed: 0 ins; 21 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10535.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10535/head:pull/10535 PR: https://git.openjdk.org/jdk/pull/10535 From tholenstein at openjdk.org Mon Oct 3 14:16:20 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 14:16:20 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v2] In-Reply-To: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> Message-ID: > "Difference to current graph" opens a new window with a difference graph in IGV. Unfortunately, it was throwing a `java.lang.IllegalArgumentException` > > # Overview > In IGV, for every opened graph, that is not a difference graph, the user can right-click on any other graph in the Outline and select "Difference to current graph". This opens a difference graph showing the difference between the currently opened graph and the selected graph. > difference to current > > If the current graph is already a difference graph, the function is disabled. Same if the selected graph is identical to the opened one. > difference disabled > > # Implementation > The problem was that the difference graph did not keep track of which two `InputGraphs` it was based on. Therefore the functions `getFirstGraph()` and `getSecondGraph` in `DiagramViewModel` did not work properly. Now, `InputGraph` keeps track of the first and second `InputGraph` if it is a difference graph (`isDiffGraph`). And `getFirstGraph()` and `getSecondGraph` are updated to return the right `InputGraph`s for difference graphs. > > --------- > ### Progress > - [ ] Change must be properly reviewed (1 review required, with at least 1 [Reviewer](https://openjdk.org/bylaws#reviewer)) > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > > > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533` \ > `$ git checkout pull/10533` > > Update a local copy of the PR: \ > `$ git checkout pull/10533` \ > `$ git pull https://git.openjdk.org/jdk pull/10533/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 10533` > > View PR using the GUI difftool: \ > `$ git pr show -t 10533` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/10533.diff > >
Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: rename view_difference to viewDifference ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10533/files - new: https://git.openjdk.org/jdk/pull/10533/files/c983dc9a..f5aacd3b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10533&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10533&range=00-01 Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10533.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533 PR: https://git.openjdk.org/jdk/pull/10533 From tholenstein at openjdk.org Mon Oct 3 14:16:21 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 14:16:21 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v2] In-Reply-To: <15oIUcDeADWwyAsOT8JlBj6M36TKawXipBO_SV93jao=.6cf59ebd-88eb-48bf-a587-1dd6f1f16802@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> <15oIUcDeADWwyAsOT8JlBj6M36TKawXipBO_SV93jao=.6cf59ebd-88eb-48bf-a587-1dd6f1f16802@github.com> Message-ID: <5eXSV9XF5yGjP10XpWRawDD4cw9HSNdUkHCADrFN6xE=.8c70ec28-feb0-4997-9bb1-051dfd4cd2cd@github.com> On Mon, 3 Oct 2022 14:01:17 GMT, Roberto Casta?eda Lozano wrote: > Looks good, thanks for fixing this! > > A minor nit: please rename `view_difference` with `viewDifference`for style consistency. Thanks! I changed it now ------------- PR: https://git.openjdk.org/jdk/pull/10533 From chagedorn at openjdk.org Mon Oct 3 14:23:18 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Oct 2022 14:23:18 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v2] In-Reply-To: References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> Message-ID: <3KoWbdwx8dw_LwZ6ut0w00IeFixsQ_5NN3AhXRhAwbs=.d5bb32d0-e6c7-4ef5-974e-51074496b7a7@github.com> On Mon, 3 Oct 2022 14:16:20 GMT, Tobias Holenstein wrote: >> "Difference to current graph" opens a new window with a difference graph in IGV. Unfortunately, it was throwing a `java.lang.IllegalArgumentException` >> >> # Overview >> In IGV, for every opened graph, that is not a difference graph, the user can right-click on any other graph in the Outline and select "Difference to current graph". This opens a difference graph showing the difference between the currently opened graph and the selected graph. >> difference to current >> >> If the current graph is already a difference graph, the function is disabled. Same if the selected graph is identical to the opened one. >> difference disabled >> >> # Implementation >> The problem was that the difference graph did not keep track of which two `InputGraphs` it was based on. Therefore the functions `getFirstGraph()` and `getSecondGraph` in `DiagramViewModel` did not work properly. Now, `InputGraph` keeps track of the first and second `InputGraph` if it is a difference graph (`isDiffGraph`). And `getFirstGraph()` and `getSecondGraph` are updated to return the right `InputGraph`s for difference graphs. >> >> --------- >> ### Progress >> - [ ] Change must be properly reviewed (1 review required, with at least 1 [Reviewer](https://openjdk.org/bylaws#reviewer)) >> - [x] Change must not contain extraneous whitespace >> - [x] Commit message must refer to an issue >> >> >> >> ### Reviewing >>
Using git >> >> Checkout this PR locally: \ >> `$ git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533` \ >> `$ git checkout pull/10533` >> >> Update a local copy of the PR: \ >> `$ git checkout pull/10533` \ >> `$ git pull https://git.openjdk.org/jdk pull/10533/head` >> >>
>>
Using Skara CLI tools >> >> Checkout this PR locally: \ >> `$ git pr checkout 10533` >> >> View PR using the GUI difftool: \ >> `$ git pr show -t 10533` >> >>
>>
Using diff file >> >> Download this PR as a diff file: \ >> https://git.openjdk.org/jdk/pull/10533.diff >> >>
> > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > rename view_difference to viewDifference Looks good! src/utils/IdealGraphVisualizer/Data/src/main/java/com/sun/hotspot/igv/data/InputGraph.java line 42: > 40: private Map nodeToBlock; > 41: private boolean isDiffGraph; > 42: private InputGraph firstGraph, secondGraph; Can be made `final`. I would also split this line into two lines. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 418: > 416: } > 417: return firstGraph; > 418: Empty line can be removed. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10533 From tholenstein at openjdk.org Mon Oct 3 14:35:44 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 14:35:44 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v3] In-Reply-To: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> Message-ID: > "Difference to current graph" opens a new window with a difference graph in IGV. Unfortunately, it was throwing a `java.lang.IllegalArgumentException` > > # Overview > In IGV, for every opened graph, that is not a difference graph, the user can right-click on any other graph in the Outline and select "Difference to current graph". This opens a difference graph showing the difference between the currently opened graph and the selected graph. > difference to current > > If the current graph is already a difference graph, the function is disabled. Same if the selected graph is identical to the opened one. > difference disabled > > # Implementation > The problem was that the difference graph did not keep track of which two `InputGraphs` it was based on. Therefore the functions `getFirstGraph()` and `getSecondGraph` in `DiagramViewModel` did not work properly. Now, `InputGraph` keeps track of the first and second `InputGraph` if it is a difference graph (`isDiffGraph`). And `getFirstGraph()` and `getSecondGraph` are updated to return the right `InputGraph`s for difference graphs. > > --------- > ### Progress > - [ ] Change must be properly reviewed (1 review required, with at least 1 [Reviewer](https://openjdk.org/bylaws#reviewer)) > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > > > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533` \ > `$ git checkout pull/10533` > > Update a local copy of the PR: \ > `$ git checkout pull/10533` \ > `$ git pull https://git.openjdk.org/jdk pull/10533/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 10533` > > View PR using the GUI difftool: \ > `$ git pr show -t 10533` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/10533.diff > >
Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: - split lines - remove empty line ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10533/files - new: https://git.openjdk.org/jdk/pull/10533/files/f5aacd3b..ff0477b4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10533&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10533&range=01-02 Stats: 5 lines in 2 files changed: 2 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10533.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533 PR: https://git.openjdk.org/jdk/pull/10533 From tholenstein at openjdk.org Mon Oct 3 14:35:44 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 14:35:44 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v3] In-Reply-To: <3KoWbdwx8dw_LwZ6ut0w00IeFixsQ_5NN3AhXRhAwbs=.d5bb32d0-e6c7-4ef5-974e-51074496b7a7@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> <3KoWbdwx8dw_LwZ6ut0w00IeFixsQ_5NN3AhXRhAwbs=.d5bb32d0-e6c7-4ef5-974e-51074496b7a7@github.com> Message-ID: On Mon, 3 Oct 2022 14:19:17 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: >> >> - split lines >> - remove empty line > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 418: > >> 416: } >> 417: return firstGraph; >> 418: > > Empty line can be removed. done ------------- PR: https://git.openjdk.org/jdk/pull/10533 From tholenstein at openjdk.org Mon Oct 3 14:38:36 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 14:38:36 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v2] In-Reply-To: <3KoWbdwx8dw_LwZ6ut0w00IeFixsQ_5NN3AhXRhAwbs=.d5bb32d0-e6c7-4ef5-974e-51074496b7a7@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> <3KoWbdwx8dw_LwZ6ut0w00IeFixsQ_5NN3AhXRhAwbs=.d5bb32d0-e6c7-4ef5-974e-51074496b7a7@github.com> Message-ID: On Mon, 3 Oct 2022 14:19:56 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> rename view_difference to viewDifference > > Looks good! Thank you @chhagedorn and @robcasloz for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10533 From tholenstein at openjdk.org Mon Oct 3 14:38:39 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 14:38:39 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v3] In-Reply-To: <3KoWbdwx8dw_LwZ6ut0w00IeFixsQ_5NN3AhXRhAwbs=.d5bb32d0-e6c7-4ef5-974e-51074496b7a7@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> <3KoWbdwx8dw_LwZ6ut0w00IeFixsQ_5NN3AhXRhAwbs=.d5bb32d0-e6c7-4ef5-974e-51074496b7a7@github.com> Message-ID: <9WzLccYV3kc9kpI0q7rrnM4Up4a6snSEi9IwL0zZw2w=.4ecadbce-a703-4866-a900-c2e15f26ef1b@github.com> On Mon, 3 Oct 2022 14:13:33 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: >> >> - split lines >> - remove empty line > > src/utils/IdealGraphVisualizer/Data/src/main/java/com/sun/hotspot/igv/data/InputGraph.java line 42: > >> 40: private Map nodeToBlock; >> 41: private boolean isDiffGraph; >> 42: private InputGraph firstGraph, secondGraph; > > Can be made `final`. I would also split this line into two lines. I split it into two lines. They can not be final because the constructor `InputGraph(InputGraph firstGraph, InputGraph secondGraph)` calls the second constructor `InputGraph(String name)` and both assign `firstGraph` and `secondGraph` ------------- PR: https://git.openjdk.org/jdk/pull/10533 From tholenstein at openjdk.org Mon Oct 3 14:48:59 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 14:48:59 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v13] In-Reply-To: References: Message-ID: > Remove dead code from the IGV code base. There are many unused or redundant functions in the code Tobias Holenstein has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: - Merge remote-tracking branch 'origin/master' into JDK-8290011 - fix imports after merge - Merge remote-tracking branch 'origin/master' into JDK-8290011 - Undo removal of toString() in Group.java - make fond constants uppercase - more code cleanup - style update 2 - delete unused Graph.java and Edge.java - Code style update - remove unused MouseOverAction - ... and 37 more: https://git.openjdk.org/jdk/compare/ccc1d316...c213bed5 ------------- Changes: https://git.openjdk.org/jdk/pull/10197/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=12 Stats: 4423 lines in 120 files changed: 324 ins; 3528 del; 571 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Mon Oct 3 15:08:38 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Oct 2022 15:08:38 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v14] In-Reply-To: References: Message-ID: > Remove dead code from the IGV code base. There are many unused or redundant functions in the code Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: merge current master ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/c213bed5..b714dfd4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=12-13 Stats: 512 lines in 45 files changed: 88 ins; 144 del; 280 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From bulasevich at openjdk.org Mon Oct 3 15:09:29 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Mon, 3 Oct 2022 15:09:29 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: References: Message-ID: <2yUcnrPtVYR6qtB6WWX2Lwdisrii7JbcFOsB5LkT468=.d6314491-9a33-49f2-9217-570bb659bace@github.com> On Thu, 22 Sep 2022 20:49:55 GMT, Vladimir Kozlov wrote: > My builds failed: > > ``` > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparceDataReadStream.java:42: warning: [lossy-conversions] implicit cast from int to byte in compound assignment is possibly lossy > b |= (0xFF & curr_byte_) >> (8 - byte_pos_); > ^ > ``` Yes. It was a collision with "8244681: Add a warning for possibly lossy conversion in compound assignments" change. I rebased my branch and fixed the issue. Now it should be Ok. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Mon Oct 3 15:09:36 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Mon, 3 Oct 2022 15:09:36 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v2] In-Reply-To: References: Message-ID: On Thu, 22 Sep 2022 21:11:28 GMT, Doug Simon wrote: >> Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - warning fix and name fix >> - optimize the encoding >> - fix >> - 8293170: Improve encoding of the debuginfo nmethod section > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparceDataReadStream.java line 28: > >> 26: import sun.jvm.hotspot.debugger.*; >> 27: >> 28: public class CompressedSparceDataReadStream extends CompressedReadStream { > > CompressedSparceDataReadStream -> CompressedSparseDataReadStream Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Mon Oct 3 15:09:27 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Mon, 3 Oct 2022 15:09:27 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: References: Message-ID: On Thu, 22 Sep 2022 20:45:58 GMT, Dean Long wrote: > What is the performance impact of making several of the methods virtual? Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From qamai at openjdk.org Mon Oct 3 16:12:37 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 3 Oct 2022 16:12:37 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v8] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: limit tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/0cb30b8d..1ad99969 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=06-07 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Mon Oct 3 16:12:42 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 3 Oct 2022 16:12:42 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v4] In-Reply-To: References: Message-ID: On Tue, 27 Sep 2022 20:56:20 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: >> >> - code styles >> - Merge branch 'master' into unsignedDiv >> - Merge branch 'master' into unsignedDiv >> - Merge branch 'master' into unsignedDiv >> - micro >> - whitespace >> - whitespace >> - large divisor >> - fix build >> - fix 32-bit >> - ... and 10 more: https://git.openjdk.org/jdk/compare/3419363e...156f65c0 > > I also asked to have separate positive and negative divisor values in JMH tests. In addition to mixed ones. @vnkozlov I have addressed your reviews in the last commit. Thanks very much. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From iveresov at openjdk.org Mon Oct 3 17:43:40 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Mon, 3 Oct 2022 17:43:40 GMT Subject: Integrated: 8242115: C2 SATB barriers are not safepoint-safe In-Reply-To: References: Message-ID: On Fri, 30 Sep 2022 21:22:47 GMT, Igor Veresov wrote: > Implement load pinning, use it for g1 pre-value loads, add verification that the load control dependency is kept and that there are no safepoints between the load of the pre-value and the marking check (with the exception of the CAS intrinsics where it's permitted). This pull request has now been integrated. Changeset: c6e3daa5 Author: Igor Veresov URL: https://git.openjdk.org/jdk/commit/c6e3daa5fa0bdbe70e5bb63302bbce1abc5453fe Stats: 185 lines in 6 files changed: 172 ins; 2 del; 11 mod 8242115: C2 SATB barriers are not safepoint-safe Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/10517 From svkamath at openjdk.org Mon Oct 3 17:49:14 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Mon, 3 Oct 2022 17:49:14 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v11] In-Reply-To: <5pqC4k2fyhaYIa9d6D3Dciv2ohYR-JCPvYW7lZsbXhw=.4a3071d6-39b8-4828-86a4-9c3871401844@github.com> References: <_Ghl2lsnrBhiWvVD3TMiwGo6SfQLl6idczb1QVqLa_I=.7cfa48e2-2987-43e0-a689-0e3462e4d270@github.com> <5pqC4k2fyhaYIa9d6D3Dciv2ohYR-JCPvYW7lZsbXhw=.4a3071d6-39b8-4828-86a4-9c3871401844@github.com> Message-ID: On Fri, 30 Sep 2022 09:59:02 GMT, Bhavana Kilambi wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comment to update test case > > Hi, would you be adding IR tests to verify the generation of the the newly introduced IR nodes? @Bhavana-Kilambi, I plan to do this in a separate PR along with the gtest. Here's the bug https://bugs.openjdk.org/browse/JDK-8293323. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From svkamath at openjdk.org Mon Oct 3 17:49:17 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Mon, 3 Oct 2022 17:49:17 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v11] In-Reply-To: References: <_Ghl2lsnrBhiWvVD3TMiwGo6SfQLl6idczb1QVqLa_I=.7cfa48e2-2987-43e0-a689-0e3462e4d270@github.com> Message-ID: On Mon, 3 Oct 2022 08:34:06 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86.ad line 3674: >> >>> 3672: %} >>> 3673: >>> 3674: instruct convF2HF_mem_reg(memory mem, regF src, kReg ktmp, rRegI rtmp) %{ >> >> You can use `kmovwl` instead which will relax the avx512bw constraint, however, you will need avx512vl for `evcvtps2ph`. Thanks. > > Rethink about it, you can get 0x01 by right shifting k0 to the right - `kshiftrw(ktmp, k0, 15)` @merykitty Thanks for the suggestion. I will update the instruct to use kmovwl. I will also experiment with kshiftrw and let you know. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From eastigeevich at openjdk.org Mon Oct 3 20:24:53 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 3 Oct 2022 20:24:53 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: <2yUcnrPtVYR6qtB6WWX2Lwdisrii7JbcFOsB5LkT468=.d6314491-9a33-49f2-9217-570bb659bace@github.com> References: <2yUcnrPtVYR6qtB6WWX2Lwdisrii7JbcFOsB5LkT468=.d6314491-9a33-49f2-9217-570bb659bace@github.com> Message-ID: On Mon, 3 Oct 2022 15:05:24 GMT, Boris Ulasevich wrote: >> My builds failed: >> >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparceDataReadStream.java:42: warning: [lossy-conversions] implicit cast from int to byte in compound assignment is possibly lossy >> b |= (0xFF & curr_byte_) >> (8 - byte_pos_); >> ^ > >> My builds failed: >> >> ``` >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparceDataReadStream.java:42: warning: [lossy-conversions] implicit cast from int to byte in compound assignment is possibly lossy >> b |= (0xFF & curr_byte_) >> (8 - byte_pos_); >> ^ >> ``` > > Yes. It was a collision with "8244681: Add a warning for possibly lossy conversion in compound assignments" change. I rebased my branch and fixed the issue. Now it should be Ok. @bulasevich, it would be useful to have some examples of cases either in the JBS issues or the PR. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Mon Oct 3 20:38:26 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 3 Oct 2022 20:38:26 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v2] In-Reply-To: References: Message-ID: On Fri, 23 Sep 2022 14:32:12 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - warning fix and name fix > - optimize the encoding > - fix > - 8293170: Improve encoding of the debuginfo nmethod section src/hotspot/share/compiler/oopMap.hpp line 377: > 375: OopMapValue current() { return _omv; } > 376: #ifdef ASSERT > 377: int stream_position() { return _stream.position(); } This change is an example that something is wrong with the design. There is a concrete class `CompressedReadStream` with expected behaviour of `position`: no changes to `_stream`. We have to break this contract to be able to compile `OopMapStream`. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From dlong at openjdk.org Mon Oct 3 20:49:55 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 3 Oct 2022 20:49:55 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 15:04:10 GMT, Boris Ulasevich wrote: > > What is the performance impact of making several of the methods virtual? > > Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From kvn at openjdk.org Tue Oct 4 00:25:21 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Oct 2022 00:25:21 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 06:32:11 GMT, Quan Anh Mai wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: >> >> - Merge branch 'master' into peephole >> - some fix >> - add benchmark >> - Merge branch 'master' into peephole >> - refactor >> - fix? >> - refactor >> - attempt >> - attempt >> - build fix >> - ... and 13 more: https://git.openjdk.org/jdk/compare/72bd41b8...78b4a3f2 > > Thanks a lot for your testing, can you run the tests again, please? @merykitty I started testing for version 03. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From kvn at openjdk.org Tue Oct 4 00:31:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Oct 2022 00:31:24 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v8] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 16:12:37 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > limit tests Good. I submitted testing for version 07. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From jbhateja at openjdk.org Tue Oct 4 06:09:11 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Oct 2022 06:09:11 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> Message-ID: <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> On Thu, 29 Sep 2022 07:35:12 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove untaken code paths on x86 match rules > > Hi @XiaohongGong , Thanks!, changes looks good to me, an IR framework test will complement the patch. > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/pull/10192/files#diff-33d0866101d899687e04303fb2232574f2cb796ce060528a243ebdc9903b01b1L2484 since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. ------------- PR: https://git.openjdk.org/jdk/pull/10192 From jbhateja at openjdk.org Tue Oct 4 06:52:56 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Oct 2022 06:52:56 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v11] In-Reply-To: References: <_Ghl2lsnrBhiWvVD3TMiwGo6SfQLl6idczb1QVqLa_I=.7cfa48e2-2987-43e0-a689-0e3462e4d270@github.com> Message-ID: On Mon, 3 Oct 2022 17:47:00 GMT, Smita Kamath wrote: >> Rethink about it, you can get 0x01 by right shifting k0 to the right - `kshiftrw(ktmp, k0, 15)` > > @merykitty Thanks for the suggestion. I will update the instruct to use kmovwl. I will also experiment with kshiftrw and let you know. > You can use `kmovwl` instead which will relax the avx512bw constraint, however, you will need avx512vl for `evcvtps2ph`. Thanks. Yes, in general all AVX512VL targets support AVX512BW, but cloud instances give freedom to enable custom features. Regarding K0, as per section "15.6.1.1" of SDM, expectation is that K0 can appear in source and destination of regular non predication context, k0 should always contain all true mask so it should be unmodifiable for subsequent usages i.e. should not be present as destination of a mask manipulating instruction. Your suggestion is to have that in source but it may not work either. Changing existing sequence to use kmovw and replace AVX512BW with AVX512VL will again mean introducing an additional predication check for this pattern. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From thartmann at openjdk.org Tue Oct 4 07:16:06 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 4 Oct 2022 07:16:06 GMT Subject: RFR: 8288302: Shenandoah: SIGSEGV in vm maybe related to jit compiling xerces In-Reply-To: <4h2hq0pbUiyUWAA4ddq8Z0FaQ22q_uQk2E93ci7fJ1E=.d87c892f-a084-470e-a91d-acd6ab86b612@github.com> References: <4h2hq0pbUiyUWAA4ddq8Z0FaQ22q_uQk2E93ci7fJ1E=.d87c892f-a084-470e-a91d-acd6ab86b612@github.com> Message-ID: On Thu, 29 Sep 2022 14:31:16 GMT, Roland Westrelin wrote: > During igvn, at a heap stable test, a dominating heap stable test is > found that can be used to optimize out the current one. But this area > of the graph is actually dying and the current test has lost one of > its projection already, something the logic doesn't expect and which > causes the crash. The fix I propose is to simply detect that the heap > stable test is dying and skip the transformation. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10491 From thartmann at openjdk.org Tue Oct 4 07:17:22 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 4 Oct 2022 07:17:22 GMT Subject: RFR: 8290964: C2 compilation fails with assert "non-reduction loop contains reduction nodes" In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 13:41:07 GMT, Roberto Casta?eda Lozano wrote: > This changeset removes the [reduction information consistency assertion](https://github.com/openjdk/jdk/blob/46633e644a8ab94ceb75803bd40739214f8a60e8/src/hotspot/share/opto/superword.cpp#L2458-L2459) in `SuperWord::output()`, which has proven to report too many false positives (inconsistencies that do not lead to miscompilation) since its introduction by [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622), despite the efforts to reduce the false positive rate in [JDK-8286177](https://bugs.openjdk.org/browse/JDK-8286177). During the time the assertion has been enabled in our internal CI system, no true positive case (reported inconsistencies actually leading to a miscompilation or a crash) has been observed. > > An alternative solution would be to wait for [JDK-8287087](https://bugs.openjdk.org/browse/JDK-8287087) (work in progress), which proposes a refactoring of the reduction analysis logic that eliminates by construction the need for this assertion. This changeset proposes removing the assertion earlier, to reduce noise in test environments. > > #### Testing > > - hs-tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64). That looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10535 From rcastanedalo at openjdk.org Tue Oct 4 07:25:08 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 4 Oct 2022 07:25:08 GMT Subject: RFR: 8290964: C2 compilation fails with assert "non-reduction loop contains reduction nodes" In-Reply-To: References: Message-ID: <0Zyd4FLb7gBZls2YwS4QixvJwD3VffNoKIpyX1Jsbk4=.cabc8253-684a-413d-ae4a-fda59f883c39@github.com> On Tue, 4 Oct 2022 07:13:26 GMT, Tobias Hartmann wrote: > That looks reasonable to me. Thanks, Tobias. ------------- PR: https://git.openjdk.org/jdk/pull/10535 From rcastanedalo at openjdk.org Tue Oct 4 07:31:04 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 4 Oct 2022 07:31:04 GMT Subject: RFR: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" [v3] In-Reply-To: References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> Message-ID: On Mon, 3 Oct 2022 14:35:44 GMT, Tobias Holenstein wrote: >> "Difference to current graph" opens a new window with a difference graph in IGV. Unfortunately, it was throwing a `java.lang.IllegalArgumentException` >> >> # Overview >> In IGV, for every opened graph, that is not a difference graph, the user can right-click on any other graph in the Outline and select "Difference to current graph". This opens a difference graph showing the difference between the currently opened graph and the selected graph. >> difference to current >> >> If the current graph is already a difference graph, the function is disabled. Same if the selected graph is identical to the opened one. >> difference disabled >> >> # Implementation >> The problem was that the difference graph did not keep track of which two `InputGraphs` it was based on. Therefore the functions `getFirstGraph()` and `getSecondGraph` in `DiagramViewModel` did not work properly. Now, `InputGraph` keeps track of the first and second `InputGraph` if it is a difference graph (`isDiffGraph`). And `getFirstGraph()` and `getSecondGraph` are updated to return the right `InputGraph`s for difference graphs. >> >> --------- >> ### Progress >> - [ ] Change must be properly reviewed (1 review required, with at least 1 [Reviewer](https://openjdk.org/bylaws#reviewer)) >> - [x] Change must not contain extraneous whitespace >> - [x] Commit message must refer to an issue >> >> >> >> ### Reviewing >>
Using git >> >> Checkout this PR locally: \ >> `$ git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533` \ >> `$ git checkout pull/10533` >> >> Update a local copy of the PR: \ >> `$ git checkout pull/10533` \ >> `$ git pull https://git.openjdk.org/jdk pull/10533/head` >> >>
>>
Using Skara CLI tools >> >> Checkout this PR locally: \ >> `$ git pr checkout 10533` >> >> View PR using the GUI difftool: \ >> `$ git pr show -t 10533` >> >>
>>
Using diff file >> >> Download this PR as a diff file: \ >> https://git.openjdk.org/jdk/pull/10533.diff >> >>
> > Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: > > - split lines > - remove empty line Marked as reviewed by rcastanedalo (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10533 From tholenstein at openjdk.org Tue Oct 4 07:33:51 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 4 Oct 2022 07:33:51 GMT Subject: Integrated: JDK-8294564: IGV: IllegalArgumentException for "Difference to current graph" In-Reply-To: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> References: <2Y_2ggAHMExBs7S2EK0YHtzszsm5KMHfs6DQcm44MaE=.d313b70a-7035-405d-a233-9733fe59fc6c@github.com> Message-ID: On Mon, 3 Oct 2022 13:05:52 GMT, Tobias Holenstein wrote: > "Difference to current graph" opens a new window with a difference graph in IGV. Unfortunately, it was throwing a `java.lang.IllegalArgumentException` > > # Overview > In IGV, for every opened graph, that is not a difference graph, the user can right-click on any other graph in the Outline and select "Difference to current graph". This opens a difference graph showing the difference between the currently opened graph and the selected graph. > difference to current > > If the current graph is already a difference graph, the function is disabled. Same if the selected graph is identical to the opened one. > difference disabled > > # Implementation > The problem was that the difference graph did not keep track of which two `InputGraphs` it was based on. Therefore the functions `getFirstGraph()` and `getSecondGraph` in `DiagramViewModel` did not work properly. Now, `InputGraph` keeps track of the first and second `InputGraph` if it is a difference graph (`isDiffGraph`). And `getFirstGraph()` and `getSecondGraph` are updated to return the right `InputGraph`s for difference graphs. > > --------- > ### Progress > - [ ] Change must be properly reviewed (1 review required, with at least 1 [Reviewer](https://openjdk.org/bylaws#reviewer)) > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > > > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk pull/10533/head:pull/10533` \ > `$ git checkout pull/10533` > > Update a local copy of the PR: \ > `$ git checkout pull/10533` \ > `$ git pull https://git.openjdk.org/jdk pull/10533/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 10533` > > View PR using the GUI difftool: \ > `$ git pr show -t 10533` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/10533.diff > >
This pull request has now been integrated. Changeset: f957ce99 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/f957ce995969a39827c17023b083d3bd84a1317c Stats: 175 lines in 9 files changed: 95 ins; 21 del; 59 mod 8294564: IGV: IllegalArgumentException for "Difference to current graph" Reviewed-by: rcastanedalo, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10533 From roland at openjdk.org Tue Oct 4 08:03:07 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Oct 2022 08:03:07 GMT Subject: RFR: 8288302: Shenandoah: SIGSEGV in vm maybe related to jit compiling xerces In-Reply-To: References: <4h2hq0pbUiyUWAA4ddq8Z0FaQ22q_uQk2E93ci7fJ1E=.d87c892f-a084-470e-a91d-acd6ab86b612@github.com> Message-ID: On Mon, 3 Oct 2022 13:00:49 GMT, Aleksey Shipilev wrote: >> During igvn, at a heap stable test, a dominating heap stable test is >> found that can be used to optimize out the current one. But this area >> of the graph is actually dying and the current test has lost one of >> its projection already, something the logic doesn't expect and which >> causes the crash. The fix I propose is to simply detect that the heap >> stable test is dying and skip the transformation. > > All right, looks reasonable. thanks @shipilev @TobiHartmann for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10491 From roland at openjdk.org Tue Oct 4 08:12:52 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Oct 2022 08:12:52 GMT Subject: Integrated: 8288302: Shenandoah: SIGSEGV in vm maybe related to jit compiling xerces In-Reply-To: <4h2hq0pbUiyUWAA4ddq8Z0FaQ22q_uQk2E93ci7fJ1E=.d87c892f-a084-470e-a91d-acd6ab86b612@github.com> References: <4h2hq0pbUiyUWAA4ddq8Z0FaQ22q_uQk2E93ci7fJ1E=.d87c892f-a084-470e-a91d-acd6ab86b612@github.com> Message-ID: On Thu, 29 Sep 2022 14:31:16 GMT, Roland Westrelin wrote: > During igvn, at a heap stable test, a dominating heap stable test is > found that can be used to optimize out the current one. But this area > of the graph is actually dying and the current test has lost one of > its projection already, something the logic doesn't expect and which > causes the crash. The fix I propose is to simply detect that the heap > stable test is dying and skip the transformation. This pull request has now been integrated. Changeset: bf39b184 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/bf39b184ca8aabcc51dc6ea4eee046c69b278710 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8288302: Shenandoah: SIGSEGV in vm maybe related to jit compiling xerces Reviewed-by: shade, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10491 From roland at openjdk.org Tue Oct 4 08:25:07 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Oct 2022 08:25:07 GMT Subject: RFR: 8292780: misc tests failed "assert(false) failed: graph should be schedulable" In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 19:51:52 GMT, Dean Long wrote: > I think it's fine to integrate this now because it fixes the problem, but I thought @jatin-bhateja might have an improvement to the fix that doesn't involve bailing out of the optimization. Maybe file a follow-up RFE for that? I filed: https://bugs.openjdk.org/browse/JDK-8294750 Thanks @dean-long @TobiHartmann @chhagedorn for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10410 From roland at openjdk.org Tue Oct 4 08:38:15 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Oct 2022 08:38:15 GMT Subject: Integrated: 8292780: misc tests failed "assert(false) failed: graph should be schedulable" In-Reply-To: References: Message-ID: On Fri, 23 Sep 2022 15:48:21 GMT, Roland Westrelin wrote: > PhaseMacroExpand::generate_partial_inlining_block() adds > LoadVectorMasked nodes to the IR graph and then > LoadNode::split_through_phi() tries to split one of them through phi > but because that method ignores the mask input to that LoadNode (it > only knows about control, memory and address inputs) the resulting > graph is broken. Fix I propose is to skip > LoadNode::split_through_phi() for those LoadVector nodes that have > extra inputs. This pull request has now been integrated. Changeset: 16047e83 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/16047e8308a845436f7003e09e604a88bb370632 Stats: 52 lines in 2 files changed: 51 ins; 0 del; 1 mod 8292780: misc tests failed "assert(false) failed: graph should be schedulable" Reviewed-by: dlong, chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10410 From chagedorn at openjdk.org Tue Oct 4 09:11:14 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 4 Oct 2022 09:11:14 GMT Subject: RFR: 8290964: C2 compilation fails with assert "non-reduction loop contains reduction nodes" In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 13:41:07 GMT, Roberto Casta?eda Lozano wrote: > This changeset removes the [reduction information consistency assertion](https://github.com/openjdk/jdk/blob/46633e644a8ab94ceb75803bd40739214f8a60e8/src/hotspot/share/opto/superword.cpp#L2458-L2459) in `SuperWord::output()`, which has proven to report too many false positives (inconsistencies that do not lead to miscompilation) since its introduction by [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622), despite the efforts to reduce the false positive rate in [JDK-8286177](https://bugs.openjdk.org/browse/JDK-8286177). During the time the assertion has been enabled in our internal CI system, no true positive case (reported inconsistencies actually leading to a miscompilation or a crash) has been observed. > > An alternative solution would be to wait for [JDK-8287087](https://bugs.openjdk.org/browse/JDK-8287087) (work in progress), which proposes a refactoring of the reduction analysis logic that eliminates by construction the need for this assertion. This changeset proposes removing the assertion earlier, to reduce noise in test environments. > > #### Testing > > - hs-tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64). That makes sense, looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10535 From qamai at openjdk.org Tue Oct 4 09:11:28 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Oct 2022 09:11:28 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v11] In-Reply-To: References: <_Ghl2lsnrBhiWvVD3TMiwGo6SfQLl6idczb1QVqLa_I=.7cfa48e2-2987-43e0-a689-0e3462e4d270@github.com> Message-ID: On Tue, 4 Oct 2022 06:49:53 GMT, Jatin Bhateja wrote: >> @merykitty Thanks for the suggestion. I will update the instruct to use kmovwl. I will also experiment with kshiftrw and let you know. > >> You can use `kmovwl` instead which will relax the avx512bw constraint, however, you will need avx512vl for `evcvtps2ph`. Thanks. > > Yes, in general all AVX512VL targets support AVX512BW, but cloud instances give freedom to enable custom features. Regarding K0, as per section "15.6.1.1" of SDM, expectation is that K0 can appear in source and destination of regular non predication context, k0 should always contain all true mask so it should be unmodifiable for subsequent usages i.e. should not be present as destination of a mask manipulating instruction. Your suggestion is to have that in source but it may not work either. Changing existing sequence to use kmovw and replace AVX512BW with AVX512VL will again mean introducing an additional predication check for this pattern. Ah I get it, the encoding of k0 is treated specially in predicated instructions to refer to an all-set mask, but the register itself may not actually contain that value. So usage in `kshiftrw` may fail. In that case I think we can generate an all-set mask on the fly using `kxnorw(ktmp, ktmp)` to save a GPR in this occasion. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From rcastanedalo at openjdk.org Tue Oct 4 09:24:48 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 4 Oct 2022 09:24:48 GMT Subject: RFR: 8290964: C2 compilation fails with assert "non-reduction loop contains reduction nodes" In-Reply-To: References: Message-ID: On Tue, 4 Oct 2022 09:07:45 GMT, Christian Hagedorn wrote: > That makes sense, looks good! Thanks for reviewing, Christian. ------------- PR: https://git.openjdk.org/jdk/pull/10535 From lucy at openjdk.org Tue Oct 4 09:49:03 2022 From: lucy at openjdk.org (Lutz Schmidt) Date: Tue, 4 Oct 2022 09:49:03 GMT Subject: RFR: 8294578: [PPC64] C2: Missing is_oop information when using disjoint compressed oops mode [v2] In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 13:14:58 GMT, Martin Doerr wrote: >> Fix missing is_oop information shown by assertion `assert(t->base() == Type::Int || t->base() == Type::Half || t->base() == Type::FloatCon || t->base() == Type::FloatBot) failed: Unexpected type`. See JBS issue for details. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Revert decodeN_Disjoint_isel_Ex part. Not needed. Looks good to me. Thanks for finding and removing this annoyance! ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.org/jdk/pull/10484 From mdoerr at openjdk.org Tue Oct 4 10:15:21 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Oct 2022 10:15:21 GMT Subject: RFR: 8294578: [PPC64] C2: Missing is_oop information when using disjoint compressed oops mode [v2] In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 13:14:58 GMT, Martin Doerr wrote: >> Fix missing is_oop information shown by assertion `assert(t->base() == Type::Int || t->base() == Type::Half || t->base() == Type::FloatCon || t->base() == Type::FloatBot) failed: Unexpected type`. See JBS issue for details. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Revert decodeN_Disjoint_isel_Ex part. Not needed. Thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/10484 From mdoerr at openjdk.org Tue Oct 4 10:16:45 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Oct 2022 10:16:45 GMT Subject: Integrated: 8294578: [PPC64] C2: Missing is_oop information when using disjoint compressed oops mode In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 11:07:34 GMT, Martin Doerr wrote: > Fix missing is_oop information shown by assertion `assert(t->base() == Type::Int || t->base() == Type::Half || t->base() == Type::FloatCon || t->base() == Type::FloatBot) failed: Unexpected type`. See JBS issue for details. This pull request has now been integrated. Changeset: f03934e2 Author: Martin Doerr URL: https://git.openjdk.org/jdk/commit/f03934e270aa86de3c6832f9754caba05726726b Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod 8294578: [PPC64] C2: Missing is_oop information when using disjoint compressed oops mode Reviewed-by: shade, lucy ------------- PR: https://git.openjdk.org/jdk/pull/10484 From tholenstein at openjdk.org Tue Oct 4 11:30:50 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 4 Oct 2022 11:30:50 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v15] In-Reply-To: References: Message-ID: > Remove dead code from the IGV code base. There are many unused or redundant functions in the code Tobias Holenstein has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 49 commits: - Merge remote-tracking branch 'origin/master' into JDK-8290011 - merge current master - Merge remote-tracking branch 'origin/master' into JDK-8290011 - fix imports after merge - Merge remote-tracking branch 'origin/master' into JDK-8290011 - Undo removal of toString() in Group.java - make fond constants uppercase - more code cleanup - style update 2 - delete unused Graph.java and Edge.java - ... and 39 more: https://git.openjdk.org/jdk/compare/3b476a17...7faec1a5 ------------- Changes: https://git.openjdk.org/jdk/pull/10197/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=14 Stats: 4759 lines in 137 files changed: 344 ins; 3602 del; 813 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From rrich at openjdk.org Tue Oct 4 12:05:06 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 4 Oct 2022 12:05:06 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v4] In-Reply-To: <8WISmA_G-lrsl2anFmiC9ENTg3D2e7E9P1IpZxelphk=.7f9790e7-aecf-4481-ada1-a53ea96e147a@github.com> References: <8WISmA_G-lrsl2anFmiC9ENTg3D2e7E9P1IpZxelphk=.7f9790e7-aecf-4481-ada1-a53ea96e147a@github.com> Message-ID: On Wed, 21 Sep 2022 07:04:08 GMT, Richard Reingruber wrote: >> The method `frame::interpreter_frame_last_sp()` is a platform method in the sense that it is not declared in a shared header file. It is declared and defined on some platforms though (x86 and aarch64 I think). >> >> `frame::interpreter_frame_last_sp()` existed on these platforms before vm continuations (aka loom). Shared code that is part of the vm continuations implementation references it. This breaks the platform abstraction. >> >> Using unextended_sp is problematic too because there are no guarantees by the platform abstraction layer for it. In fact unextended_sp < sp is possible on ppc64 and aarch64. >> >> This fix changes the callers of is_sp_in_continuation() >> >> ```c++ >> static inline bool is_sp_in_continuation(const ContinuationEntry* entry, intptr_t* const sp) { >> return entry->entry_sp() > sp; >> } >> >> >> to pass the actual sp. This is correct because the following is true on all platforms: >> >> ```c++ >> a.sp() > E->entry_sp() > b.sp() > c.sp() >> >> >> where `a`, `b`, `c` are stack frames in call order and `E` is a ContinuationEntry. `a` is the caller frame of the continuation entry frame that corresponds to `E`. >> >> is_sp_in_continuation() will then return true for `b.sp()` and `c.sp()` and false for `a.sp()` >> >> Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Remove `Unimplemented` definitions of interpreter_frame_last_sp I think this ready to be integrated. I'll do so tomorrow if there are no further comments until then. Thanks again for your reviews @fisk and @dean-long ------------- PR: https://git.openjdk.org/jdk/pull/9411 From bulasevich at openjdk.org Tue Oct 4 13:02:09 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 4 Oct 2022 13:02:09 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: References: Message-ID: <7xYMYWazVI1VHQK0g9GafFqeH9kA-EWZfsOHhy4cIXs=.8ad5ab4f-6e3a-4b25-92a7-c027a390e018@github.com> On Mon, 3 Oct 2022 20:46:07 GMT, Dean Long wrote: > > > What is the performance impact of making several of the methods virtual? > > > > > > Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. > > I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. Right. With counters in virtual methods, I see that reading debug information is less frequent than writing. Anyway. Let me rewrite code without virtual functions. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From tholenstein at openjdk.org Tue Oct 4 13:05:23 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 4 Oct 2022 13:05:23 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v16] In-Reply-To: References: Message-ID: > Remove dead code from the IGV code base. There are many unused or redundant functions in the code Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: cleanup after merger with master ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/7faec1a5..a7271acf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=14-15 Stats: 340 lines in 13 files changed: 263 ins; 28 del; 49 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Tue Oct 4 13:30:38 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 4 Oct 2022 13:30:38 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v17] In-Reply-To: References: Message-ID: > Cleanup of the code in IGV without changing the functionality. > > - removed dead code (unused classes, functions, variables) from the IGV code base > - merged (and removed) redundant functions > - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV > - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package > - made class variables `final` whenever possible > - removed `this.` in `this.funtion()` funciton calls when it was not needed > - used lambdas instead of anonymous class if possible > - fixed whitespace issues (e.g. double whitespace) > - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` > - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: - remove intersects - remove trailing whitespace in DiagramViewer ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/a7271acf..8ec9675d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=15-16 Stats: 12 lines in 2 files changed: 0 ins; 9 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Tue Oct 4 13:39:36 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 4 Oct 2022 13:39:36 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v18] In-Reply-To: References: Message-ID: > Cleanup of the code in IGV without changing the functionality. > > - removed dead code (unused classes, functions, variables) from the IGV code base > - merged (and removed) redundant functions > - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV > - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package > - made class variables `final` whenever possible > - removed `this.` in `this.funtion()` funciton calls when it was not needed > - used lambdas instead of anonymous class if possible > - fixed whitespace issues (e.g. double whitespace) > - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` > - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: re-add hideDuplicates.png ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/8ec9675d..891adb1b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=16-17 Stats: 0 lines in 1 file changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Tue Oct 4 14:05:28 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 4 Oct 2022 14:05:28 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v18] In-Reply-To: <0NA_dHwRej6PYoU-ejEEJrX7DxdG6AMMh6KC90cU0Yk=.2fb1e525-5561-4cbd-8221-f729378b1755@github.com> References: <0NA_dHwRej6PYoU-ejEEJrX7DxdG6AMMh6KC90cU0Yk=.2fb1e525-5561-4cbd-8221-f729378b1755@github.com> Message-ID: On Thu, 8 Sep 2022 12:43:47 GMT, Roberto Casta?eda Lozano wrote: > This changeset goes beyond trivial cleanups (removing dead code, trailing whitespace, legacy functionality, etc.), and it would help if you could summarize (and motivate if necessary) the main changes in it. > > I found that switching among opened graphs from different groups does not update anymore the highlighted graphs in the Outline window, nor the content of the Bytecode and Control Flow windows. Maybe an effect of splitting #10164? > > A few more comments: > > * I would also prefer to leave the `toString()` methods in, for ease of debugging. > * Why are some tests in `InputGraphTest.java` removed? Were they not run before? > * I agree with enforcing alphabetic order of imports, but I would personally prefer to import explicitly all individual classes rather than using wildcards (matter of taste though, I do not think we have any style guidelines for tools like IGV). > * Please update the copyright headers, at least for files with non-trivial changes. Hi @robcasloz , I updated the PR to the current master to resolve merge conflicts. Further I re-added `toString()` and printing functions as well as all the removed tests (`InputGraphTest.java`). Regarding wildcards in imports I decided to leave them in since they were already used in the code before. The book `Code Complete` suggests to use wildcards for imports, whereas the google style guide for java argues against wildcards. I could not find anything in our style-guide, but personally I have no strong opinion on this matter. I think the copyright headers are updated for all non-trivial changes, or where is it missing? The PR should now be ready to be reviewed again. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10197 From bulasevich at openjdk.org Tue Oct 4 16:45:38 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 4 Oct 2022 16:45:38 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v3] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: rewrite code without virtual functions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/c2e05c89..fa82262c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=01-02 Stats: 101 lines in 2 files changed: 68 ins; 9 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From rcastanedalo at openjdk.org Tue Oct 4 17:01:30 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 4 Oct 2022 17:01:30 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v18] In-Reply-To: References: Message-ID: <27ETKLx8aWB7ixXbmfDXnGFFcFQNNnOLx9WHvPDQPNQ=.17cf231d-c617-4a15-ac96-077995b0e9ce@github.com> On Tue, 4 Oct 2022 13:39:36 GMT, Tobias Holenstein wrote: >> Cleanup of the code in IGV without changing the functionality. >> >> - removed dead code (unused classes, functions, variables) from the IGV code base >> - merged (and removed) redundant functions >> - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV >> - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package >> - made class variables `final` whenever possible >> - removed `this.` in `this.funtion()` funciton calls when it was not needed >> - used lambdas instead of anonymous class if possible >> - fixed whitespace issues (e.g. double whitespace) >> - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` >> - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > re-add hideDuplicates.png Thanks for addressing my comments, Tobias! I tested the changeset manually and hit the following exception when placing the mouse pointer on graph edges to show their tooltips: [INFO] java.lang.ClassCastException: class com.sun.hotspot.igv.view.widgets.InputSlotWidget cannot be cast to class com.sun.hotspot.igv.graph.Figure (com.sun.hotspot.igv.view.widgets.InputSlotWidget is in unnamed module of loader org.netbeans.StandardModule$OneModuleClassLoader @17b716f7; com.sun.hotspot.igv.graph.Figure is in unnamed module of loader org.netbeans.StandardModule$OneModuleClassLoader @6fcf432a) [INFO] at com.sun.hotspot.igv.view.widgets.LineWidget$1.select(LineWidget.java:142) [INFO] at org.netbeans.modules.visual.action.SelectAction.mouseReleased(SelectAction.java:86) [INFO] at org.netbeans.api.visual.widget.SceneComponent$Operator$3.operate(SceneComponent.java:535) [INFO] at org.netbeans.api.visual.widget.SceneComponent.processLocationOperator(SceneComponent.java:250) [INFO] at org.netbeans.api.visual.widget.SceneComponent.mouseReleased(SceneComponent.java:137) [INFO] at java.desktop/java.awt.AWTEventMulticaster.mouseReleased(AWTEventMulticaster.java:297) [INFO] at java.desktop/java.awt.Component.processMouseEvent(Component.java:6635) [INFO] at java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) [INFO] at java.desktop/java.awt.Component.processEvent(Component.java:6400) [INFO] at java.desktop/java.awt.Container.processEvent(Container.java:2263) [INFO] at java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5011) [INFO] at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) [INFO] at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) [INFO] at java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4918) [INFO] at java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4547) [INFO] at java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4488) [INFO] at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) [INFO] at java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2772) [INFO] at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) [INFO] at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) [INFO] at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) [INFO] at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) [INFO] at java.base/java.security.AccessController.doPrivileged(Native Method) [INFO] at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) [INFO] at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95) [INFO] at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) [INFO] at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) [INFO] at java.base/java.security.AccessController.doPrivileged(Native Method) [INFO] at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) [INFO] at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) [INFO] at org.netbeans.core.TimableEventQueue.dispatchEvent(TimableEventQueue.java:136) [INFO] [catch] at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) [INFO] at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90) ------------- Changes requested by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/10197 From kvn at openjdk.org Tue Oct 4 17:28:51 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Oct 2022 17:28:51 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 10:22:42 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > check index Testing passed. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8025 From kvn at openjdk.org Tue Oct 4 18:45:45 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Oct 2022 18:45:45 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v8] In-Reply-To: References: Message-ID: <-6HCuDDu6zPmaBcrCYEHUlve0O2htHfxviLrjy5tKcY=.1d517cd7-a5a5-4d0c-a388-f36c7e9fc14c@github.com> On Mon, 3 Oct 2022 16:12:37 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > limit tests Testing results are good. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9947 From bulasevich at openjdk.org Tue Oct 4 20:31:44 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 4 Oct 2022 20:31:44 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v4] In-Reply-To: References: Message-ID: <3kOvEAlksouNjqXDcn3XNuJj97kx3uhj8UzlmZIYq_o=.517b466d-4577-4c11-b5d9-7709176136cf@github.com> > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/fa82262c..c2054359 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=02-03 Stats: 20 lines in 3 files changed: 0 ins; 0 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From qamai at openjdk.org Wed Oct 5 00:29:28 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 5 Oct 2022 00:29:28 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 10:22:42 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > check index That's great, thanks very much for your review. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From qamai at openjdk.org Wed Oct 5 00:30:37 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 5 Oct 2022 00:30:37 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v8] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 16:12:37 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > limit tests Thanks for your review. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From epeter at openjdk.org Wed Oct 5 08:50:02 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Oct 2022 08:50:02 GMT Subject: RFR: 8294839: Disable StressLongCountedLoop in compiler/loopopts/TestRemoveEmptyLoop.java Message-ID: As I have explained in [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), `-XX:StressLongCountedLoop=200000000` leads to a timeout in `compiler/loopopts/TestRemoveEmptyLoop.java` because an int-loop is transformed into a long-loop, but the loop is not collapsed/removed because we only remove empty int-loops. This is a sub-task of [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), where we hope to implement empty loop removal for LongCountedLoops. Manually tested the test with and without the flag. Running more tests now. ------------- Commit messages: - 8294839: Disable StressLongCountedLoop in compiler/loopopts/TestRemoveEmptyLoop.java Changes: https://git.openjdk.org/jdk/pull/10569/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10569&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294839 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10569.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10569/head:pull/10569 PR: https://git.openjdk.org/jdk/pull/10569 From thartmann at openjdk.org Wed Oct 5 09:16:34 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Oct 2022 09:16:34 GMT Subject: RFR: 8294839: Disable StressLongCountedLoop in compiler/loopopts/TestRemoveEmptyLoop.java In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 08:31:56 GMT, Emanuel Peter wrote: > As I have explained in [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), `-XX:StressLongCountedLoop=200000000` leads to a timeout in `compiler/loopopts/TestRemoveEmptyLoop.java` because an int-loop is transformed into a long-loop, but the loop is not collapsed/removed because we only remove empty int-loops. > > This is a sub-task of [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), where we hope to implement empty loop removal for LongCountedLoops. > > Manually tested the test with and without the flag. > Running more tests now. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10569 From rrich at openjdk.org Wed Oct 5 09:36:23 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 5 Oct 2022 09:36:23 GMT Subject: RFR: 8294514: Wrong initialization of nmethod::_consts_offset for native nmethods In-Reply-To: References: Message-ID: <1k-9c1wufnuwxBC1YdmUxs_WuiEFRuKJSZ5kBoGSrq8=.9c055e89-ea7d-4145-9ea3-07434ae9c85d@github.com> On Thu, 29 Sep 2022 18:03:36 GMT, Vladimir Kozlov wrote: >> Hi, >> >> this small fix copies the initialization of `nmethod::_consts_offset` from the nmethod constructor for c1/c2 compiled nmethods to the constructor for native nmethods. >> >> Manual testing: >> >> * I built commit 86097e20df0652c2f0c6865f0ec62e7989db45ca and reproduced the issue on x86_64 as described in the JBS-bug >> * Then I build commit 5ce7741ffd9323909b7424255d696525db3d01d2 and found that I could not reproduce the issue. I also verified that on PPC64 the constants section of the continuation enter intrinsic is printed now (-XX:+PrintAssembly). >> >> The fix passed our CI testing: most JCK and JTREG test, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. >> >> Thanks, Richard. > > Good. Thanks for the reviews @vnkozlov and @dean-long ------------- PR: https://git.openjdk.org/jdk/pull/10482 From rrich at openjdk.org Wed Oct 5 09:36:24 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 5 Oct 2022 09:36:24 GMT Subject: Integrated: 8294514: Wrong initialization of nmethod::_consts_offset for native nmethods In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 08:16:27 GMT, Richard Reingruber wrote: > Hi, > > this small fix copies the initialization of `nmethod::_consts_offset` from the nmethod constructor for c1/c2 compiled nmethods to the constructor for native nmethods. > > Manual testing: > > * I built commit 86097e20df0652c2f0c6865f0ec62e7989db45ca and reproduced the issue on x86_64 as described in the JBS-bug > * Then I build commit 5ce7741ffd9323909b7424255d696525db3d01d2 and found that I could not reproduce the issue. I also verified that on PPC64 the constants section of the continuation enter intrinsic is printed now (-XX:+PrintAssembly). > > The fix passed our CI testing: most JCK and JTREG test, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. > > Thanks, Richard. This pull request has now been integrated. Changeset: b4e74aea Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/b4e74aeabfd41ee76b6bf8b779c1741b30b6f438 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8294514: Wrong initialization of nmethod::_consts_offset for native nmethods Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.org/jdk/pull/10482 From rcastanedalo at openjdk.org Wed Oct 5 09:47:35 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Oct 2022 09:47:35 GMT Subject: Integrated: 8290964: C2 compilation fails with assert "non-reduction loop contains reduction nodes" In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 13:41:07 GMT, Roberto Casta?eda Lozano wrote: > This changeset removes the [reduction information consistency assertion](https://github.com/openjdk/jdk/blob/46633e644a8ab94ceb75803bd40739214f8a60e8/src/hotspot/share/opto/superword.cpp#L2458-L2459) in `SuperWord::output()`, which has proven to report too many false positives (inconsistencies that do not lead to miscompilation) since its introduction by [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622), despite the efforts to reduce the false positive rate in [JDK-8286177](https://bugs.openjdk.org/browse/JDK-8286177). During the time the assertion has been enabled in our internal CI system, no true positive case (reported inconsistencies actually leading to a miscompilation or a crash) has been observed. > > An alternative solution would be to wait for [JDK-8287087](https://bugs.openjdk.org/browse/JDK-8287087) (work in progress), which proposes a refactoring of the reduction analysis logic that eliminates by construction the need for this assertion. This changeset proposes removing the assertion earlier, to reduce noise in test environments. > > #### Testing > > - hs-tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64). This pull request has now been integrated. Changeset: 4bdd1c91 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/4bdd1c914859e221c64208d47ef309d463609953 Stats: 21 lines in 3 files changed: 0 ins; 21 del; 0 mod 8290964: C2 compilation fails with assert "non-reduction loop contains reduction nodes" Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10535 From chagedorn at openjdk.org Wed Oct 5 09:55:41 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 5 Oct 2022 09:55:41 GMT Subject: RFR: 8294839: Disable StressLongCountedLoop in compiler/loopopts/TestRemoveEmptyLoop.java In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 08:31:56 GMT, Emanuel Peter wrote: > As I have explained in [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), `-XX:StressLongCountedLoop=200000000` leads to a timeout in `compiler/loopopts/TestRemoveEmptyLoop.java` because an int-loop is transformed into a long-loop, but the loop is not collapsed/removed because we only remove empty int-loops. > > This is a sub-task of [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), where we hope to implement empty loop removal for LongCountedLoops. > > Manually tested the test with and without the flag. > Running more tests now. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10569 From richard.reingruber at sap.com Wed Oct 5 09:59:16 2022 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Wed, 5 Oct 2022 09:59:16 +0000 Subject: Maybe set -XX:+VerifyContinuations in tests? Message-ID: Hi, would it make sense to set -XX:+VerifyContinuations for all tests in test/jdk/jdk/internal/vm/Continuation/? I'd think so. At least the tests Basic.java and Fuzz.java together with VerifyContinuations brought up issue with the ppc loom port. The test HumongousStack.java already does it. Also in the loom repository VerifyContinuations is switched on by default. Is exporting _JAVA_OPTIONS=-XX:+VerifyContinuations the only way to do it now? Thanks, Richard. From rrich at openjdk.org Wed Oct 5 10:12:21 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 5 Oct 2022 10:12:21 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v5] In-Reply-To: References: Message-ID: > The method `frame::interpreter_frame_last_sp()` is a platform method in the sense that it is not declared in a shared header file. It is declared and defined on some platforms though (x86 and aarch64 I think). > > `frame::interpreter_frame_last_sp()` existed on these platforms before vm continuations (aka loom). Shared code that is part of the vm continuations implementation references it. This breaks the platform abstraction. > > Using unextended_sp is problematic too because there are no guarantees by the platform abstraction layer for it. In fact unextended_sp < sp is possible on ppc64 and aarch64. > > This fix changes the callers of is_sp_in_continuation() > > ```c++ > static inline bool is_sp_in_continuation(const ContinuationEntry* entry, intptr_t* const sp) { > return entry->entry_sp() > sp; > } > > > to pass the actual sp. This is correct because the following is true on all platforms: > > ```c++ > a.sp() > E->entry_sp() > b.sp() > c.sp() > > > where `a`, `b`, `c` are stack frames in call order and `E` is a ContinuationEntry. `a` is the caller frame of the continuation entry frame that corresponds to `E`. > > is_sp_in_continuation() will then return true for `b.sp()` and `c.sp()` and false for `a.sp()` > > Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'master' - Remove `Unimplemented` definitions of interpreter_frame_last_sp - Only pass the actual sp when calling is_sp_in_continuation() - Merge branch 'master' - Merge branch 'master' - Remove platform dependent method interpreter_frame_last_sp() from shared code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9411/files - new: https://git.openjdk.org/jdk/pull/9411/files/14c97290..f49ecf54 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9411&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9411&range=03-04 Stats: 41961 lines in 1346 files changed: 21873 ins; 13281 del; 6807 mod Patch: https://git.openjdk.org/jdk/pull/9411.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9411/head:pull/9411 PR: https://git.openjdk.org/jdk/pull/9411 From shade at redhat.com Wed Oct 5 10:20:14 2022 From: shade at redhat.com (Aleksey Shipilev) Date: Wed, 5 Oct 2022 12:20:14 +0200 Subject: Maybe set -XX:+VerifyContinuations in tests? In-Reply-To: References: Message-ID: On 10/5/22 11:59, Reingruber, Richard wrote: > would it make sense to set -XX:+VerifyContinuations for all tests in > test/jdk/jdk/internal/vm/Continuation/? In my experience, this would make the test rather long, so I'd rather avoid this. > Is exporting _JAVA_OPTIONS=-XX:+VerifyContinuations the only way to do it now? The usual way is to pass options to "make test": $ make test TEST="jdk_loom hotspot_loom" TEST_VM_OPTS="-XX:+VerifyContinuations" -- Thanks, -Aleksey From duke at openjdk.org Wed Oct 5 11:35:24 2022 From: duke at openjdk.org (Sacha Coppey) Date: Wed, 5 Oct 2022 11:35:24 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v10] In-Reply-To: <01N2Slfoz83bKVvbH3Ja0O0cOI-rcagrV6jeIdi3dws=.4cce1f7e-2223-4013-bb11-8319aef46444@github.com> References: <01N2Slfoz83bKVvbH3Ja0O0cOI-rcagrV6jeIdi3dws=.4cce1f7e-2223-4013-bb11-8319aef46444@github.com> Message-ID: On Wed, 14 Sep 2022 09:39:07 GMT, Sacha Coppey wrote: >> This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. >> It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. More testing is performed in Native Image. > > Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: > > Remove noinline attribute by fixing sign extended value Hello, this PR has been stuck for some time now. What should I do to proceed? ------------- PR: https://git.openjdk.org/jdk/pull/9587 From rkennke at openjdk.org Wed Oct 5 11:45:11 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 5 Oct 2022 11:45:11 GMT Subject: RFR: 8292082: Deprecate UseRTM* for removal In-Reply-To: References: Message-ID: On Tue, 9 Aug 2022 15:59:21 GMT, Roman Kennke wrote: > HotSpot supports RTM (restricted transactional memory) to be used for locking and deoptimization. RTM has since been disabled in Intel processors due to security vulnerabilities [0] and IBM removed support for it since its Power 10 line. RTM adds unnecessarily to complexity and maintenance burden. > > I would like to propose to deprecate the relevant flags for removal, and actually remove the flags and all related code in a later release, unless somebody comes up with a good reason and performance comparison to show that it's worth keeping. > > [0] https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions#History_and_bugs > > Testing: > - [x] runtime/CommandLine/VMDeprecatedOptions.java > - [x] tier1 Is there interest in getting rid of RTM locking, or should I close this PR as won't-fix at this point? ------------- PR: https://git.openjdk.org/jdk/pull/9810 From dnsimon at openjdk.org Wed Oct 5 12:04:10 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 5 Oct 2022 12:04:10 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v7] In-Reply-To: References: <5AOnVFRDCk63egnuw1HsPUgB1N9E-YuygeIJZV9REZQ=.0118fb0e-184a-4580-ac3d-88238f9ca5a8@github.com> Message-ID: <69doshv-kwhZYUPEx6u98FZOse3rkizVg2WZkeML8Kc=.16884845-c9b3-4927-971d-0a3e6ba15aef@github.com> On Wed, 24 Aug 2022 11:33:01 GMT, Fei Yang wrote: >>> Do you have details about testing performed in Native Image as mentioned in PR decription? >> >> Yes, the RISC-V LLVM backend for Native Image passes 99% of the tests performed, which is similar to the other LLVM backends. >> >>> I see you added more changes in hotspot file sharedRuntime_riscv.cpp guarded by macro INCLUDE_JVMCI. Searching for INCLUDE_JVMCI or COMPILER2_OR_JVMCI in src/hotspot/cpu/aarch64, I see several more places checking for these macros. Have you checked if we need similar changes for your use case? >> >> I first added the changes for all places where those macros are used, but since only modifying sharedRuntime_riscv.cpp was enough to make the tests pass, I did not wanted to add code that I was not sure was useful at the moment. >> >>> Also could you explain the change made in hotspot file deoptimization.hpp? Thanks. >> >> When the method is inlined, the `if (trap_request < 0)` check behaves incorrectly when the `nmethod` is compiled by JVMCI. Even though the boolean is true, the function returns -11 instead of -1, and the `if (unloaded_class_index >= 0)` checks have the same issue, causing an access to an illegal index of an array. I am not sure why this happens, as it works correctly for method not compiled by JVMCI. > >> > I see you added more changes in hotspot file sharedRuntime_riscv.cpp guarded by macro INCLUDE_JVMCI. Searching for INCLUDE_JVMCI or COMPILER2_OR_JVMCI in src/hotspot/cpu/aarch64, I see several more places checking for these macros. Have you checked if we need similar changes for your use case? >> >> I first added the changes for all places where those macros are used, but since only modifying sharedRuntime_riscv.cpp was enough to make the tests pass, I did not wanted to add code that I was not sure was useful at the moment. > > Well, that sounds fragile to me since you are depending on a relatively small set of JTreg tests here. I think an analysis is needed here to be sure about whether those are really needed or not. @RealFYang are you ok with deferring further changes to a future RFE? ------------- PR: https://git.openjdk.org/jdk/pull/9587 From dholmes at openjdk.org Wed Oct 5 12:17:08 2022 From: dholmes at openjdk.org (David Holmes) Date: Wed, 5 Oct 2022 12:17:08 GMT Subject: RFR: 8292082: Deprecate UseRTM* for removal In-Reply-To: References: Message-ID: On Tue, 9 Aug 2022 15:59:21 GMT, Roman Kennke wrote: > HotSpot supports RTM (restricted transactional memory) to be used for locking and deoptimization. RTM has since been disabled in Intel processors due to security vulnerabilities [0] and IBM removed support for it since its Power 10 line. RTM adds unnecessarily to complexity and maintenance burden. > > I would like to propose to deprecate the relevant flags for removal, and actually remove the flags and all related code in a later release, unless somebody comes up with a good reason and performance comparison to show that it's worth keeping. > > [0] https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions#History_and_bugs > > Testing: > - [x] runtime/CommandLine/VMDeprecatedOptions.java > - [x] tier1 As this still appears to be an active and useful feature there needs to be a strong motivation for removing it. ------------- PR: https://git.openjdk.org/jdk/pull/9810 From richard.reingruber at sap.com Wed Oct 5 13:59:01 2022 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Wed, 5 Oct 2022 13:59:01 +0000 Subject: Maybe set -XX:+VerifyContinuations in tests? In-Reply-To: References: Message-ID: > On 10/5/22 11:59, Reingruber, Richard wrote: > > would it make sense to set -XX:+VerifyContinuations for all tests in > > test/jdk/jdk/internal/vm/Continuation/? > In my experience, this would make the test rather long, so I'd rather avoid this. I've tested Fuzz.java#default with fastdebug. Durations varied quite a bit: -XX:-VerifyContinuations (default) Minimum duration: 1m35s Maximum duration: 2m13s -XX:+VerifyContinuations Minimum duration: 2m14s Maximum duration: 3m48s on an older "Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz" Maybe it would be ok to just set it for Fuzz.java#default. On the other hand it is probably sufficient to have dedicated runs with -XX:+VerifyContinuations. > > Is exporting _JAVA_OPTIONS=-XX:+VerifyContinuations the only way to do it now? > The usual way is to pass options to "make test": > $ make test TEST="jdk_loom hotspot_loom" TEST_VM_OPTS="-XX:+VerifyContinuations" Right, thanks. I was confused and assumed that this wouldn't work with the `/othervm` option (even though I used it myself a while ago to change gc). Thanks, Richard. ________________________________ From: Aleksey Shipilev Sent: Wednesday, October 5, 2022 12:20 To: Reingruber, Richard ; hotspot-compiler-dev at openjdk.java.net Subject: Re: Maybe set -XX:+VerifyContinuations in tests? On 10/5/22 11:59, Reingruber, Richard wrote: > would it make sense to set -XX:+VerifyContinuations for all tests in > test/jdk/jdk/internal/vm/Continuation/? In my experience, this would make the test rather long, so I'd rather avoid this. > Is exporting _JAVA_OPTIONS=-XX:+VerifyContinuations the only way to do it now? The usual way is to pass options to "make test": $ make test TEST="jdk_loom hotspot_loom" TEST_VM_OPTS="-XX:+VerifyContinuations" -- Thanks, -Aleksey -------------- next part -------------- An HTML attachment was scrubbed... URL: From rrich at openjdk.org Wed Oct 5 14:15:30 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 5 Oct 2022 14:15:30 GMT Subject: Integrated: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() In-Reply-To: References: Message-ID: On Thu, 7 Jul 2022 16:02:19 GMT, Richard Reingruber wrote: > The method `frame::interpreter_frame_last_sp()` is a platform method in the sense that it is not declared in a shared header file. It is declared and defined on some platforms though (x86 and aarch64 I think). > > `frame::interpreter_frame_last_sp()` existed on these platforms before vm continuations (aka loom). Shared code that is part of the vm continuations implementation references it. This breaks the platform abstraction. > > Using unextended_sp is problematic too because there are no guarantees by the platform abstraction layer for it. In fact unextended_sp < sp is possible on ppc64 and aarch64. > > This fix changes the callers of is_sp_in_continuation() > > ```c++ > static inline bool is_sp_in_continuation(const ContinuationEntry* entry, intptr_t* const sp) { > return entry->entry_sp() > sp; > } > > > to pass the actual sp. This is correct because the following is true on all platforms: > > ```c++ > a.sp() > E->entry_sp() > b.sp() > c.sp() > > > where `a`, `b`, `c` are stack frames in call order and `E` is a ContinuationEntry. `a` is the caller frame of the continuation entry frame that corresponds to `E`. > > is_sp_in_continuation() will then return true for `b.sp()` and `c.sp()` and false for `a.sp()` > > Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. This pull request has now been integrated. Changeset: ee6c3917 Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/ee6c39175bc47608282c52c575ce908399349e7c Stats: 31 lines in 7 files changed: 5 ins; 22 del; 4 mod 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() Reviewed-by: eosterlund, dlong ------------- PR: https://git.openjdk.org/jdk/pull/9411 From kvn at openjdk.org Wed Oct 5 16:29:17 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Oct 2022 16:29:17 GMT Subject: RFR: 8292082: Deprecate UseRTM* for removal In-Reply-To: References: Message-ID: <8iHwQuxxDk3yIiutiD8L7OD_lCdDkxiQpJNRPTwbzp8=.8549ee2b-39a3-409b-8dcc-5e1f5df6fdaa@github.com> On Wed, 5 Oct 2022 11:41:46 GMT, Roman Kennke wrote: > Is there interest in getting rid of RTM locking, or should I close this PR as won't-fix at this point? Close PR. ------------- PR: https://git.openjdk.org/jdk/pull/9810 From rkennke at openjdk.org Wed Oct 5 16:43:24 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 5 Oct 2022 16:43:24 GMT Subject: RFR: 8292082: Deprecate UseRTM* for removal In-Reply-To: References: Message-ID: <90DoIIUnU1W3qMXrCM2nmiHZKyQyamTOvFhp9A61D8g=.6b700db7-cff3-4a46-9137-6ba25eff27a4@github.com> On Tue, 9 Aug 2022 15:59:21 GMT, Roman Kennke wrote: > HotSpot supports RTM (restricted transactional memory) to be used for locking and deoptimization. RTM has since been disabled in Intel processors due to security vulnerabilities [0] and IBM removed support for it since its Power 10 line. RTM adds unnecessarily to complexity and maintenance burden. > > I would like to propose to deprecate the relevant flags for removal, and actually remove the flags and all related code in a later release, unless somebody comes up with a good reason and performance comparison to show that it's worth keeping. > > [0] https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions#History_and_bugs > > Testing: > - [x] runtime/CommandLine/VMDeprecatedOptions.java > - [x] tier1 Closing, will keep RTM locking for the time being. ------------- PR: https://git.openjdk.org/jdk/pull/9810 From rkennke at openjdk.org Wed Oct 5 16:43:25 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 5 Oct 2022 16:43:25 GMT Subject: Withdrawn: 8292082: Deprecate UseRTM* for removal In-Reply-To: References: Message-ID: On Tue, 9 Aug 2022 15:59:21 GMT, Roman Kennke wrote: > HotSpot supports RTM (restricted transactional memory) to be used for locking and deoptimization. RTM has since been disabled in Intel processors due to security vulnerabilities [0] and IBM removed support for it since its Power 10 line. RTM adds unnecessarily to complexity and maintenance burden. > > I would like to propose to deprecate the relevant flags for removal, and actually remove the flags and all related code in a later release, unless somebody comes up with a good reason and performance comparison to show that it's worth keeping. > > [0] https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions#History_and_bugs > > Testing: > - [x] runtime/CommandLine/VMDeprecatedOptions.java > - [x] tier1 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9810 From qamai at openjdk.org Wed Oct 5 16:57:51 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 5 Oct 2022 16:57:51 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL Message-ID: Hi, This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. Vector API benchmark shows the results of `MUL` operations: Before After Benchmark (size) Mode Cnt Score Error Score Error Units Change Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% Please have a look and have some reviews, thank you very much. ------------- Commit messages: - add vmulB for 8 bytes - Merge branch 'master' into improveMulVB - Merge branch 'master' into improveMulVB - Merge branch 'master' into improveMulVB - fix - mulV Changes: https://git.openjdk.org/jdk/pull/10571/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10571&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294865 Stats: 161 lines in 5 files changed: 11 ins; 62 del; 88 mod Patch: https://git.openjdk.org/jdk/pull/10571.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10571/head:pull/10571 PR: https://git.openjdk.org/jdk/pull/10571 From qamai at openjdk.org Wed Oct 5 19:57:04 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 5 Oct 2022 19:57:04 GMT Subject: RFR: 8281453: New optimization: convert `~x` into `-1-x` when `~x` is used in an arithmetic expression [v15] In-Reply-To: References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: On Thu, 29 Sep 2022 20:52:41 GMT, Zhiqiang Zang wrote: >> Similar to `(~x)+c` -> `(c-1)-x` and `~(x+c)` -> `(-c-1)-x` in #6858, we can also introduce similar optimizations for subtraction, `c-(~x)` -> `x+(c+1)` and `~(c-x)` -> `x+(-c-1)`. >> >> To generalize, I convert `~x` into `-1-x` when `~x` is used only in arithmetic expression. For example, `c-(~x)` will be converted into `c-(-1-x)` which will match other pattern and will be transformed again in next iteration and finally become `x+(c+1)`. >> >> Also the conversion from `~x` into `-1-x` happens when `x` is an arithmetic expression itself. For example, `~(x+c)` will be transformed into `-1-(x+c)` and eventually `(-c-1)-x`. >> >> The results of the microbenchmark are as follows: >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.448 ? 0.002 ns/op >> NotOpTransformation.baselineLong avgt 60 0.448 ? 0.002 ns/op >> NotOpTransformation.testInt1 avgt 60 0.615 ? 0.003 ns/op >> NotOpTransformation.testInt2 avgt 60 0.838 ? 0.004 ns/op >> NotOpTransformation.testLong1 avgt 60 0.671 ? 0.003 ns/op >> NotOpTransformation.testLong2 avgt 60 0.670 ? 0.003 ns/op >> >> Patch: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.451 ? 0.003 ns/op >> NotOpTransformation.baselineLong avgt 60 0.447 ? 0.002 ns/op >> NotOpTransformation.testInt1 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testInt2 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testLong1 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testLong2 avgt 60 0.335 ? 0.002 ns/op > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove one use check for long as well. I think it looks good, thanks. ------------- PR: https://git.openjdk.org/jdk/pull/7376 From kvn at openjdk.org Wed Oct 5 20:15:45 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Oct 2022 20:15:45 GMT Subject: RFR: 8281453: New optimization: convert `~x` into `-1-x` when `~x` is used in an arithmetic expression [v15] In-Reply-To: References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: On Thu, 29 Sep 2022 20:52:41 GMT, Zhiqiang Zang wrote: >> Similar to `(~x)+c` -> `(c-1)-x` and `~(x+c)` -> `(-c-1)-x` in #6858, we can also introduce similar optimizations for subtraction, `c-(~x)` -> `x+(c+1)` and `~(c-x)` -> `x+(-c-1)`. >> >> To generalize, I convert `~x` into `-1-x` when `~x` is used only in arithmetic expression. For example, `c-(~x)` will be converted into `c-(-1-x)` which will match other pattern and will be transformed again in next iteration and finally become `x+(c+1)`. >> >> Also the conversion from `~x` into `-1-x` happens when `x` is an arithmetic expression itself. For example, `~(x+c)` will be transformed into `-1-(x+c)` and eventually `(-c-1)-x`. >> >> The results of the microbenchmark are as follows: >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.448 ? 0.002 ns/op >> NotOpTransformation.baselineLong avgt 60 0.448 ? 0.002 ns/op >> NotOpTransformation.testInt1 avgt 60 0.615 ? 0.003 ns/op >> NotOpTransformation.testInt2 avgt 60 0.838 ? 0.004 ns/op >> NotOpTransformation.testLong1 avgt 60 0.671 ? 0.003 ns/op >> NotOpTransformation.testLong2 avgt 60 0.670 ? 0.003 ns/op >> >> Patch: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.451 ? 0.003 ns/op >> NotOpTransformation.baselineLong avgt 60 0.447 ? 0.002 ns/op >> NotOpTransformation.testInt1 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testInt2 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testLong1 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testLong2 avgt 60 0.335 ? 0.002 ns/op > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove one use check for long as well. I will test latest version. ------------- PR: https://git.openjdk.org/jdk/pull/7376 From kvn at openjdk.org Wed Oct 5 20:37:58 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Oct 2022 20:37:58 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 10:33:27 GMT, Quan Anh Mai wrote: > Hi, > > This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, > > - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. > - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. > > Vector API benchmark shows the results of `MUL` operations: > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units Change > Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% > Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% > Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% > Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% > Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% > > Please have a look and have some reviews, thank you very much. I submitted testing. ------------- PR: https://git.openjdk.org/jdk/pull/10571 From jrose at openjdk.org Wed Oct 5 20:55:21 2022 From: jrose at openjdk.org (John R Rose) Date: Wed, 5 Oct 2022 20:55:21 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v8] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 16:12:37 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > limit tests src/hotspot/share/opto/divnode.cpp line 89: > 87: //---------------magic_int_unsigned_divide_constants_down---------------------- > 88: // Compute magic multiplier and shift constant for converting a 32 bit divide > 89: // by constant into a multiply/add/shift series. Return false if calculations delete "return false" comment everywhere ------------- PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Wed Oct 5 21:05:41 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 5 Oct 2022 21:05:41 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v9] In-Reply-To: References: Message-ID: <1CQGsyF2K_t_sFZRHs2S4xo0LcTJHzFDHd3x8OAYSQc=.222cad3f-b36f-478a-ad88-2aa98ce2c64b@github.com> > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: style, comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/1ad99969..058c2ec5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=07-08 Stats: 29 lines in 1 file changed: 0 ins; 3 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Wed Oct 5 21:05:42 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 5 Oct 2022 21:05:42 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v8] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 20:51:54 GMT, John R Rose wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> limit tests > > src/hotspot/share/opto/divnode.cpp line 89: > >> 87: //---------------magic_int_unsigned_divide_constants_down---------------------- >> 88: // Compute magic multiplier and shift constant for converting a 32 bit divide >> 89: // by constant into a multiply/add/shift series. Return false if calculations > > delete "return false" comment everywhere Thanks, I have removed them. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From jrose at openjdk.org Wed Oct 5 21:15:20 2022 From: jrose at openjdk.org (John R Rose) Date: Wed, 5 Oct 2022 21:15:20 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v8] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 16:12:37 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > limit tests src/hotspot/share/opto/divnode.cpp line 52: > 50: // minor type name and parameter changes. > 51: > 52: static void magic_int_divide_constants(jint d, jlong &M, jint &s) { I would prefer to see a strategy for applying gtest to these magic* functions (all of them, I think that's six). One way to go about it would simply make them extern and write the gtests accordingly. That would make me much happier than seeing tests for 7, 13, and a couple other constants! An example of what I mean by using gtest is in the fix for JDK-8291649 which also applied to C2 constant folding logic. I think we need this kind of test for any non-trivial constant folding or strength reduction logic, at least moving forward. If the static magic functions were moved into a header file (not required but surely the high road here) I'd call it javaArithmetic.hpp. And I'd consider moving java_add etc into there as well. Alternatively, the static magic functions could be moved into globalDefinitions.hpp which is surprising but perhaps less disruptive than making a new header file for Java arithmetic but not factoring in the existing functions. I don't recommend making a header file just for this particular algorithm all by itself; I'd rather lump more "stuff" into a single header file. (I mildly disagree with count_leading_zeros and count_trailing_zeroes being all by their lonesomes, and would expect a lumpier design putting more such stuff into powerOfTwo.hpp or a similar place. Naming and grouping is hard.) ------------- PR: https://git.openjdk.org/jdk/pull/9947 From cslucas at openjdk.org Wed Oct 5 21:43:15 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Oct 2022 21:43:15 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Windows/Linux/MAC fastdebug/release > - hotspot_all > - tier1 > - Renaissance > - dacapo > - new IR-based tests Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Address PR feedback. Fix test & one bug. Set RAM parameter to true by default. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9073/files - new: https://git.openjdk.org/jdk/pull/9073/files/abd474e6..203b40b1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9073&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9073&range=09-10 Stats: 205 lines in 8 files changed: 89 ins; 89 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/9073.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9073/head:pull/9073 PR: https://git.openjdk.org/jdk/pull/9073 From cslucas at openjdk.org Wed Oct 5 22:14:09 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Oct 2022 22:14:09 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v10] In-Reply-To: References: <4EHHCOhhR5NWDpbWsHdcVl4z0dtCWSghcxk86b940PU=.62aaad87-e812-4361-8a93-1241627ecb5c@github.com> Message-ID: On Fri, 30 Sep 2022 23:16:59 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressing PR feedback. Added new constraint for case of merging SR and NSR allocations. > > DaCapo 9.12 I think. > `DaCapo-xalan-large` showed -36% and -27% in 2 runs. Flags: `--size large --iterations 20 xalan` > `DaCapo-lusearch-large` showed -19% and -114% in 2 runs. Flags: `--size large --iterations 20 lusearch-fix` > > @ericcaspole said that Dacapo results are not stable. > > On MacOS M1 there is no difference in results (insignificant <0.5%). Hi @vnkozlov, @merykitty - I pushed some changes to address your suggestions. @vnkozlov - Thank you for the additional info about DaCapo. Looks like you're doing a comparison based on CPU or Wall Time, is that so? I did some further investigation on my end and I didn't see much change in Wall Time or Allocation Rate in Xalan or LUSearch. The DaCapo benchmarks don't allocate a lot of memory and in my experiments so far I don't see the current patch having a big impact on the performance of those benchmarks. I'm wondering if the effect that you're experiencing is due to some unrelated effect? _Still, I admit that I only did performance tests on Linux since I expected my changes to be independent of the target architecture and/or OS. I'm going to run additional tests on different targets and investigate any regression that I may find._ In a previous batch of perf. experiments I saw noticeable gains in some Renaissance benchmarks (experiments run on Linux). Something around a 10% reduction in allocation rate in a few benchmarks and in others I didn't see any statistical difference. I'm going to run some more perf. experiments and in more platforms/OS and get back to you. If there is any particular benchmark or argument that you think it's important for me to use, please let me know. I'm considering making the `ReduceAllocationMerges` an experimental option. I think that makes it more clear that we have a stable patch but we are still actively working to make it more widely applicable and thus increasing its positive impact on performance. What do you think? ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Wed Oct 5 22:31:20 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Oct 2022 22:31:20 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:43:15 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR feedback. Fix test & one bug. Set RAM parameter to true by default. The only is issue is that experimental flags usually have `false` default value. Alternatively you can make it diagnostic which we use for C2 optimizations flags. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From cslucas at openjdk.org Wed Oct 5 22:44:30 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Oct 2022 22:44:30 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:43:15 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR feedback. Fix test & one bug. Set RAM parameter to true by default. My idea was to have it as `false` by default. TBH, I thought that was kind of the "policy" for adding new optimizations. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Wed Oct 5 22:47:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Oct 2022 22:47:26 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:43:15 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR feedback. Fix test & one bug. Set RAM parameter to true by default. DaCapo and Renaissance are good for testing it I think. That is where I see variations. May be we can try to use `-XX:-TieredCompilation` to see if using only C2 have effect. It seems we don't have a lot of cases where this optimization helps. May for future work based on these benchmarks (and others) we can collect cases when this optimization does not work (or even bailout compilation). BTW, were you able to remove all allocations in your test `run_IfElseInLoop()`? What about test case in https://bugs.openjdk.org/browse/JDK-6853701 ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Wed Oct 5 22:56:19 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Oct 2022 22:56:19 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 22:40:13 GMT, Cesar Soares Lucas wrote: > My idea was to have it as `false` by default. TBH, I thought that was kind of the "policy" for adding new optimizations. Yes, experimental flag is good for now if you DO plan to switch it ON later (and convert flag into other type). The trouble with `false` experimental flags is that code become dead and "rot" if nobody care about it. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Wed Oct 5 23:00:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Oct 2022 23:00:26 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 22:40:13 GMT, Cesar Soares Lucas wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Address PR feedback. Fix test & one bug. Set RAM parameter to true by default. > > My idea was to have it as `false` by default. TBH, I thought that was kind of the "policy" for adding new optimizations. @JohnTortugo I just noticed that both new tests missing `{}` in a lot of conditional cases. Please, fix it. Code style. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From cslucas at openjdk.org Wed Oct 5 23:37:11 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Oct 2022 23:37:11 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v12] In-Reply-To: References: Message-ID: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Windows/Linux/MAC fastdebug/release > - hotspot_all > - tier1 > - Renaissance > - dacapo > - new IR-based tests Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Fix code style. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9073/files - new: https://git.openjdk.org/jdk/pull/9073/files/203b40b1..a03b91a7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9073&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9073&range=10-11 Stats: 154 lines in 2 files changed: 48 ins; 20 del; 86 mod Patch: https://git.openjdk.org/jdk/pull/9073.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9073/head:pull/9073 PR: https://git.openjdk.org/jdk/pull/9073 From cslucas at openjdk.org Wed Oct 5 23:47:25 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Oct 2022 23:47:25 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 22:52:21 GMT, Vladimir Kozlov wrote: > Yes, experimental flag is good for now if you DO plan to switch it ON later (and convert flag into other type). Yes, our plan is to keep working on this until we can remove all the constraints in `can_reduce_this_phi`. We have already discussed ideas of how to remove those constraints and we'll be tackling that (and discussing it with you & the community) after we get this initial implementation upstreamed. > The trouble with false experimental flags is that code become dead and "rot" if nobody care about it. Totally understand your concern. I'm fine with setting the flag to true and making it "Diagnostic" - I just don't see it as a "Diagnostic" patch, but if it's fine for you then it's fine for me. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From cslucas at openjdk.org Wed Oct 5 23:55:25 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Oct 2022 23:55:25 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v12] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 23:37:11 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix code style. test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java line 666: > 664: > 665: @Test > 666: @IR(failOn = { IRNode.ALLOC }) @vnkozlov - Both allocs will be removed. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From cslucas at openjdk.org Thu Oct 6 00:07:56 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 6 Oct 2022 00:07:56 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 22:45:22 GMT, Vladimir Kozlov wrote: > May for future work based on these benchmarks (and others) we can collect cases when this optimization does not work (or even bailout compilation). The `TraceReducedAllocationMerges` option prints information about this. I actually have a spreadsheet where I list the cause and frequency of each case where the optimization can not be applied. > BTW, were you able to remove all allocations in your test run_IfElseInLoop()? Yes, in that case both allocations are removed. I just confirmed it with a test locally. Also, there is an IR-based test for that case. > What about test case in https://bugs.openjdk.org/browse/JDK-6853701 The current patch bails out in that test because there is a Phi (or CmpP) consuming the merge Phi. Actually, that code example is one of the tests that I run "internally". There is already work going on to improve the current patch to make it able to handle CmpP with NULL. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Thu Oct 6 01:22:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 01:22:07 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v12] In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 23:37:11 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix code style. Thank you for answering my questions and doing all research. Let do this: make flag "experimental" and `true` by default. This will allow to test it for some time after changes are integrated. If everything looks good we keep it that way otherwise we can switch it off before JDK 20 is shipped (or switch it off regardless before shipping). I will start new round of testing (more tiers) and I will ask others to review changes. I think we should go with current state of changes if it satisfied other reviewers. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Thu Oct 6 01:29:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 01:29:24 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 00:04:12 GMT, Cesar Soares Lucas wrote: >> DaCapo and Renaissance are good for testing it I think. That is where I see variations. May be we can try to use `-XX:-TieredCompilation` to see if using only C2 have effect. >> >> It seems we don't have a lot of cases where this optimization helps. May for future work based on these benchmarks (and others) we can collect cases when this optimization does not work (or even bailout compilation). >> >> BTW, were you able to remove all allocations in your test `run_IfElseInLoop()`? >> What about test case in https://bugs.openjdk.org/browse/JDK-6853701 > >> May for future work based on these benchmarks (and others) we can collect cases when this optimization does not work (or even bailout compilation). > > The `TraceReducedAllocationMerges` option prints information about this. I actually have a spreadsheet where I list the cause and frequency of each case where the optimization can not be applied. > >> BTW, were you able to remove all allocations in your test run_IfElseInLoop()? > > Yes, in that case both allocations are removed. I just confirmed it with a test locally. Also, there is an IR-based test for that case. > >> What about test case in https://bugs.openjdk.org/browse/JDK-6853701 > > The current patch bails out in that test because there is a Phi (or CmpP) consuming the merge Phi. Actually, that code example is one of the tests that I run "internally". There is already work going on to improve the current patch to make it able to handle CmpP with NULL. @JohnTortugo Looks like your latest change/fix cause new issue. I see GitHub Action testing failed with: # Internal Error (/home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:1825), pid=3815, tid=3830 # assert(_base == Int) failed: Not an Int ------------- PR: https://git.openjdk.org/jdk/pull/9073 From svkamath at openjdk.org Thu Oct 6 06:28:04 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Thu, 6 Oct 2022 06:28:04 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v13] In-Reply-To: References: Message-ID: > 8289552: Make intrinsic conversions between bit representations of half precision values and floats Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Updated instruct to use kmovw ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9781/files - new: https://git.openjdk.org/jdk/pull/9781/files/69999ce4..a00c3ecd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9781&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9781&range=11-12 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9781.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9781/head:pull/9781 PR: https://git.openjdk.org/jdk/pull/9781 From svkamath at openjdk.org Thu Oct 6 06:28:06 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Thu, 6 Oct 2022 06:28:06 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v11] In-Reply-To: References: <_Ghl2lsnrBhiWvVD3TMiwGo6SfQLl6idczb1QVqLa_I=.7cfa48e2-2987-43e0-a689-0e3462e4d270@github.com> Message-ID: On Tue, 4 Oct 2022 09:07:42 GMT, Quan Anh Mai wrote: >>> You can use `kmovwl` instead which will relax the avx512bw constraint, however, you will need avx512vl for `evcvtps2ph`. Thanks. >> >> Yes, in general all AVX512VL targets support AVX512BW, but cloud instances give freedom to enable custom features. Regarding K0, as per section "15.6.1.1" of SDM, expectation is that K0 can appear in source and destination of regular non predication context, k0 should always contain all true mask so it should be unmodifiable for subsequent usages i.e. should not be present as destination of a mask manipulating instruction. Your suggestion is to have that in source but it may not work either. Changing existing sequence to use kmovw and replace AVX512BW with AVX512VL will again mean introducing an additional predication check for this pattern. > > Ah I get it, the encoding of k0 is treated specially in predicated instructions to refer to an all-set mask, but the register itself may not actually contain that value. So usage in `kshiftrw` may fail. In that case I think we can generate an all-set mask on the fly using `kxnorw(ktmp, ktmp, ktmp)` to save a GPR in this occasion. Thanks. Hi @merykitty, I am seeing performance regression with kxnorw instruction. So I have updated the PR with kmovwl. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From dlong at openjdk.org Thu Oct 6 08:45:12 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 6 Oct 2022 08:45:12 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 10:22:42 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > check index src/hotspot/cpu/x86/peephole_x86.cpp line 28: > 26: #ifdef COMPILER2 > 27: > 28: #include "opto/peephole.hpp" I don't see why opto/peephole.hpp is useful. Why not just include peephole_x86.hpp? Then the empty peephole_.hpp for the other platforms are no longer needed. src/hotspot/cpu/x86/peephole_x86.cpp line 50: > 48: inst1 = inst0->in(1)->as_Mach(); > 49: src1 = in; > 50: } I don't understand why this optimization requires MachSpillCopy. Is that the only time we sould see mov+add or mov+shift? src/hotspot/cpu/x86/peephole_x86.cpp line 132: > 130: cfg_->map_node_to_block(proj, nullptr); > 131: cfg_->map_node_to_block(root, block); > 132: A lot of this seems like boiler-plate that could be refactored to make writing new peephole helpers simpler and less error-prone. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From epeter at openjdk.org Thu Oct 6 10:41:26 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 6 Oct 2022 10:41:26 GMT Subject: RFR: 8294839: Disable StressLongCountedLoop in compiler/loopopts/TestRemoveEmptyLoop.java In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 08:31:56 GMT, Emanuel Peter wrote: > As I have explained in [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), `-XX:StressLongCountedLoop=200000000` leads to a timeout in `compiler/loopopts/TestRemoveEmptyLoop.java` because an int-loop is transformed into a long-loop, but the loop is not collapsed/removed because we only remove empty int-loops. > > This is a sub-task of [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), where we hope to implement empty loop removal for LongCountedLoops. > > Manually tested the test with and without the flag. > Test suite passes too, with and without the flag. Thanks for the help with understanding this issue @rwestrel , thanks for the reviews @TobiHartmann @chhagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10569 From epeter at openjdk.org Thu Oct 6 10:42:54 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 6 Oct 2022 10:42:54 GMT Subject: Integrated: 8294839: Disable StressLongCountedLoop in compiler/loopopts/TestRemoveEmptyLoop.java In-Reply-To: References: Message-ID: <2bbVxfLtLcAdM8bPCZQIClF3i-51QF6B1AZvG0-e6aQ=.66260ea1-5966-4a04-a030-d8d57a2d64cd@github.com> On Wed, 5 Oct 2022 08:31:56 GMT, Emanuel Peter wrote: > As I have explained in [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), `-XX:StressLongCountedLoop=200000000` leads to a timeout in `compiler/loopopts/TestRemoveEmptyLoop.java` because an int-loop is transformed into a long-loop, but the loop is not collapsed/removed because we only remove empty int-loops. > > This is a sub-task of [JDK-8294838](https://bugs.openjdk.org/browse/JDK-8294838), where we hope to implement empty loop removal for LongCountedLoops. > > Manually tested the test with and without the flag. > Test suite passes too, with and without the flag. This pull request has now been integrated. Changeset: 73f06468 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/73f06468ae7f9eebb8e37f2a534d2c19a8dac60d Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8294839: Disable StressLongCountedLoop in compiler/loopopts/TestRemoveEmptyLoop.java Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10569 From qamai at openjdk.org Thu Oct 6 12:28:28 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Oct 2022 12:28:28 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v2] In-Reply-To: References: Message-ID: > Hi, > > This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, > > - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. > - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. > > Vector API benchmark shows the results of `MUL` operations: > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units Change > Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% > Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% > Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% > Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% > Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% > > Please have a look and have some reviews, thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: refactor conditions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10571/files - new: https://git.openjdk.org/jdk/pull/10571/files/6bbbb077..51d39f78 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10571&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10571&range=00-01 Stats: 12 lines in 1 file changed: 4 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/10571.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10571/head:pull/10571 PR: https://git.openjdk.org/jdk/pull/10571 From fyang at openjdk.org Thu Oct 6 12:29:32 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 6 Oct 2022 12:29:32 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v10] In-Reply-To: References: <01N2Slfoz83bKVvbH3Ja0O0cOI-rcagrV6jeIdi3dws=.4cce1f7e-2223-4013-bb11-8319aef46444@github.com> Message-ID: <8WpmQnd37FlbLW2mt5xvMqTg4vVH_AZs8ng-3310Sg0=.6499c9a4-3b0c-4714-91ec-aa5d9c6a3d80@github.com> On Wed, 5 Oct 2022 11:33:14 GMT, Sacha Coppey wrote: > Hello, this PR has been stuck for some time now. What should I do to proceed? The current version does not build. I will take another look after this is rebased on the latest jdk master. ------------- PR: https://git.openjdk.org/jdk/pull/9587 From duke at openjdk.org Thu Oct 6 12:41:46 2022 From: duke at openjdk.org (Sacha Coppey) Date: Thu, 6 Oct 2022 12:41:46 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v11] In-Reply-To: References: Message-ID: <4ozXv5Ok4bRtzCO28vQ3BSb_NLatUYdcT6u_rHCqk08=.7f12ef55-5276-4919-be3c-c8ffc45c0c2b@github.com> > This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. > It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. More testing is performed in Native Image. Sacha Coppey has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Remove noinline attribute by fixing sign extended value - Remove vector registers from get_hotspot_reg - Add a comments for the change in deoptimization.hpp - Fix error when emitting LUI and removed vector registers - Ensure all JVMCI tests pass on RISC-V - Add space in switch - Avoid using set_destination when call is not jal - Use nativeInstruction_at instead of nativeCall_at to avoid wrongly initializating a call - 8290154: [JVMCI] Implement JVMCI for RISC-V ------------- Changes: https://git.openjdk.org/jdk/pull/9587/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=10 Stats: 1723 lines in 20 files changed: 1701 ins; 0 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From qamai at openjdk.org Thu Oct 6 13:12:19 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Oct 2022 13:12:19 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 08:39:45 GMT, Dean Long wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> check index > > src/hotspot/cpu/x86/peephole_x86.cpp line 28: > >> 26: #ifdef COMPILER2 >> 27: >> 28: #include "opto/peephole.hpp" > > I don't see why opto/peephole.hpp is useful. Why not just include peephole_x86.hpp? Then the empty peephole_.hpp for the other platforms are no longer needed. `opto/peephole.hpp` is needed from the generated `ad_x86_peephole.cpp` so that `addI_rRegNode::peephole` can call the helper functions. > src/hotspot/cpu/x86/peephole_x86.cpp line 50: > >> 48: inst1 = inst0->in(1)->as_Mach(); >> 49: src1 = in; >> 50: } > > I don't understand why this optimization requires MachSpillCopy. Is that the only time we sould see mov+add or mov+shift? Yes, `MachSpillCopy` is the node inserted by the register allocator to move values around. In this occasion, a move is inserted because the live range of the input overlaps with the live range of the output of an add or shift instruction. > src/hotspot/cpu/x86/peephole_x86.cpp line 132: > >> 130: cfg_->map_node_to_block(proj, nullptr); >> 131: cfg_->map_node_to_block(root, block); >> 132: > > A lot of this seems like boiler-plate that could be refactored to make writing new peephole helpers simpler and less error-prone. While it seems to be a little boiler-plate, I think a general helper that removes, inserts an arbitrary number of nodes, and connects the graph correctly is hard to write. What do you think? Thanks a lot. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From qamai at openjdk.org Thu Oct 6 14:17:21 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Oct 2022 14:17:21 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v11] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 00:04:12 GMT, Cesar Soares Lucas wrote: >> DaCapo and Renaissance are good for testing it I think. That is where I see variations. May be we can try to use `-XX:-TieredCompilation` to see if using only C2 have effect. >> >> It seems we don't have a lot of cases where this optimization helps. May for future work based on these benchmarks (and others) we can collect cases when this optimization does not work (or even bailout compilation). >> >> BTW, were you able to remove all allocations in your test `run_IfElseInLoop()`? >> What about test case in https://bugs.openjdk.org/browse/JDK-6853701 > >> May for future work based on these benchmarks (and others) we can collect cases when this optimization does not work (or even bailout compilation). > > The `TraceReducedAllocationMerges` option prints information about this. I actually have a spreadsheet where I list the cause and frequency of each case where the optimization can not be applied. > >> BTW, were you able to remove all allocations in your test run_IfElseInLoop()? > > Yes, in that case both allocations are removed. I just confirmed it with a test locally. Also, there is an IR-based test for that case. > >> What about test case in https://bugs.openjdk.org/browse/JDK-6853701 > > The current patch bails out in that test because there is a Phi (or CmpP) consuming the merge Phi. Actually, that code example is one of the tests that I run "internally". There is already work going on to improve the current patch to make it able to handle CmpP with NULL. @JohnTortugo I meant you can specify the warmup iterations for the whole test, not just some methods inside with `TestFramework::setDefaultWarmup` > The `TraceReducedAllocationMerges` option prints information about this. I actually have a spreadsheet where I list the cause and frequency of each case where the optimization can not be applied. It would be great if you include all known failure cases of scalar replacement in the IR test. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Thu Oct 6 16:24:42 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 16:24:42 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v2] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 12:28:28 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, >> >> - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. >> - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. >> >> Vector API benchmark shows the results of `MUL` operations: >> >> Before After >> Benchmark (size) Mode Cnt Score Error Score Error Units Change >> Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% >> Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% >> Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% >> Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% >> Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% >> >> Please have a look and have some reviews, thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refactor conditions My testing passed. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10571 From kvn at openjdk.org Thu Oct 6 16:29:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 16:29:08 GMT Subject: RFR: 8281453: New optimization: convert `~x` into `-1-x` when `~x` is used in an arithmetic expression [v15] In-Reply-To: References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: On Thu, 29 Sep 2022 20:52:41 GMT, Zhiqiang Zang wrote: >> Similar to `(~x)+c` -> `(c-1)-x` and `~(x+c)` -> `(-c-1)-x` in #6858, we can also introduce similar optimizations for subtraction, `c-(~x)` -> `x+(c+1)` and `~(c-x)` -> `x+(-c-1)`. >> >> To generalize, I convert `~x` into `-1-x` when `~x` is used only in arithmetic expression. For example, `c-(~x)` will be converted into `c-(-1-x)` which will match other pattern and will be transformed again in next iteration and finally become `x+(c+1)`. >> >> Also the conversion from `~x` into `-1-x` happens when `x` is an arithmetic expression itself. For example, `~(x+c)` will be transformed into `-1-(x+c)` and eventually `(-c-1)-x`. >> >> The results of the microbenchmark are as follows: >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.448 ? 0.002 ns/op >> NotOpTransformation.baselineLong avgt 60 0.448 ? 0.002 ns/op >> NotOpTransformation.testInt1 avgt 60 0.615 ? 0.003 ns/op >> NotOpTransformation.testInt2 avgt 60 0.838 ? 0.004 ns/op >> NotOpTransformation.testLong1 avgt 60 0.671 ? 0.003 ns/op >> NotOpTransformation.testLong2 avgt 60 0.670 ? 0.003 ns/op >> >> Patch: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.451 ? 0.003 ns/op >> NotOpTransformation.baselineLong avgt 60 0.447 ? 0.002 ns/op >> NotOpTransformation.testInt1 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testInt2 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testLong1 avgt 60 0.334 ? 0.002 ns/op >> NotOpTransformation.testLong2 avgt 60 0.335 ? 0.002 ns/op > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove one use check for long as well. Latest testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/7376 From kvn at openjdk.org Thu Oct 6 16:29:10 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 16:29:10 GMT Subject: RFR: 8281453: New optimization: convert `~x` into `-1-x` when `~x` is used in an arithmetic expression [v9] In-Reply-To: <9GCtDsjTHPI87UwIzQs_vlu9quyaVIOL788oEkuHHks=.abeaebdb-34f9-49e1-a68b-8e5b974fb109@github.com> References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> <0mZU-PW2DcYPutXuiyKN-fvtsSFk4QNSZuFBjec3ky4=.f5e045d2-c730-48c5-b069-b89450c2e672@github.com> <3yAr7RliyugB8jQqKxeHc99my3zO9RCCy31taABj4bo=.5b2391ec-7241-43cc-9f2f-e40d6f96f7b7@github.com> <9GCtDsjTHPI87UwIzQs_vlu9quyaVIOL788oEkuHHks=.abeaebdb-34f9-49e1-a68b-8e5b974fb109@github.com> Message-ID: <6GnVEgxxkiX9FclvthtwZUttdTYJ2gGbVL69i05sKwQ=.dbe27302-f012-43c5-a572-204aac94e269@github.com> On Thu, 29 Sep 2022 20:49:31 GMT, Zhiqiang Zang wrote: >> @CptGit It is due to the fact that GVN runs as soon as the frontend parses the code, during which the graph is incomplete, and you get `outcnt() == 0` because the uses of the node have not been parsed yet. You can defer the transformation to IGVN, which happens later. Thanks. > > @merykitty I removed the use check. Does it look good to you? I did not include test `(x + y) & ~(x + y) => 0` because I found we do not have such idealization in `AndINode` because even `x & ~x => 0` is not supported. @CptGit Did you updated performance numbers for latest changes? Did they change? ------------- PR: https://git.openjdk.org/jdk/pull/7376 From cslucas at openjdk.org Thu Oct 6 16:50:28 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 6 Oct 2022 16:50:28 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Windows/Linux/MAC fastdebug/release > - hotspot_all > - tier1 > - Renaissance > - dacapo > - new IR-based tests Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Fix x86 tests. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9073/files - new: https://git.openjdk.org/jdk/pull/9073/files/a03b91a7..9e7163a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9073&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9073&range=11-12 Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9073.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9073/head:pull/9073 PR: https://git.openjdk.org/jdk/pull/9073 From duke at openjdk.org Thu Oct 6 17:18:36 2022 From: duke at openjdk.org (Sacha Coppey) Date: Thu, 6 Oct 2022 17:18:36 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v12] In-Reply-To: References: Message-ID: > This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. > It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. More testing is performed in Native Image. Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: Replace RegisterImpl and FloatRegisterImpl uses by Register and FloatRegister ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9587/files - new: https://git.openjdk.org/jdk/pull/9587/files/46159813..225fee42 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=10-11 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From duke at openjdk.org Thu Oct 6 17:18:37 2022 From: duke at openjdk.org (Sacha Coppey) Date: Thu, 6 Oct 2022 17:18:37 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v10] In-Reply-To: <8WpmQnd37FlbLW2mt5xvMqTg4vVH_AZs8ng-3310Sg0=.6499c9a4-3b0c-4714-91ec-aa5d9c6a3d80@github.com> References: <01N2Slfoz83bKVvbH3Ja0O0cOI-rcagrV6jeIdi3dws=.4cce1f7e-2223-4013-bb11-8319aef46444@github.com> <8WpmQnd37FlbLW2mt5xvMqTg4vVH_AZs8ng-3310Sg0=.6499c9a4-3b0c-4714-91ec-aa5d9c6a3d80@github.com> Message-ID: On Thu, 6 Oct 2022 12:27:11 GMT, Fei Yang wrote: > The current version does not build. I will take another look after this is rebased on the latest jdk master. Sorry for the delay, I rebased the PR and fixed the building issue. ------------- PR: https://git.openjdk.org/jdk/pull/9587 From xxinliu at amazon.com Thu Oct 6 17:42:19 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Thu, 6 Oct 2022 10:42:19 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 Message-ID: Hi, We would like to pursuit PEA in HotSpot. I spent time thinking how to adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 elements in it. 1) flow-sensitive escape analysis 2) lazy code motion for the allocation and initialization 3) on-the-fly scalar replacement. The most complex part is 3) and it has done by C2. I'd like to leverage that, so I come up an idea to focus only on escaped objects in the algorithm and delegate others to the existing C2 phases. Here is my RFC. May I get your precious time on this? https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 The idea is based on the following two observations. 1. Stadler's PEA can cooperate with C2 EA/SR. If an object moves to the place it is about to escape, it won't impact C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA won't do anything for it anyway. If PEA don't touch a non-escaped object, it won't change its escapability. It can punt it to C2 EA/SR and the result is still same. 2. The original AllocationNode is either dead or scalar replaceable after Stadler's PEA. Stadler's algorithm virtualizes an allocation Node and materializes it on demand. There are 2 places to materialize it. 1) the virtual object is about to escape 2) MergeProcessor needs to merge an object and at least one of its predecessor has materialized. MergeProcessor has to materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). We can prove the observation 2 using 'proof of contradiction' here. Assume the original Allocation node is neither dead nor Scalar Replaced after Stadler's PEA, and program is still correct. Program must need the original allocation node somewhere. The algorithm has deleted the original allocation node in virtualization step and never bring it back. It contradicts that the program is still correct. QED. If you're convinced, then we can leverage it. In my design, I don't virtualize the original node but just leave it there. C2 MacroExpand phase will take care of the original allocation node as long as it's either dead or scalar-replaceable. It never get a chance to expand. If we restrain on-the-fly scalar replacement in Stadler's PEA, we can delegate it to C2 EA/SR! There are 3 gains: 1) I don't think I can write bug-free Scalar Replacement... 2) This approach can automatically pick up C2 EA/SR improvements in the future, such as JDK-8289943. 3) If we focus only on 'escaped objects', we even don't need to deal with deoptimization. Only 'scalar replaceable' objects need to save Object states for deoptimization. Escaped objects disqualify that. [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. "Partial escape analysis and scalar replacement for Java." Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. 2014. thanks, --lx -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From duke at openjdk.org Thu Oct 6 19:03:47 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 6 Oct 2022 19:03:47 GMT Subject: RFR: 8281453: New optimization: convert `~x` into `-1-x` when `~x` is used in an arithmetic expression [v9] In-Reply-To: <9GCtDsjTHPI87UwIzQs_vlu9quyaVIOL788oEkuHHks=.abeaebdb-34f9-49e1-a68b-8e5b974fb109@github.com> References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> <0mZU-PW2DcYPutXuiyKN-fvtsSFk4QNSZuFBjec3ky4=.f5e045d2-c730-48c5-b069-b89450c2e672@github.com> <3yAr7RliyugB8jQqKxeHc99my3zO9RCCy31taABj4bo=.5b2391ec-7241-43cc-9f2f-e40d6f96f7b7@github.com> <9GCtDsjTHPI87UwIzQs_vlu9quyaVIOL788oEkuHHks=.abeaebdb-34f9-49e1-a68b-8e5b974fb109@github.com> Message-ID: On Thu, 29 Sep 2022 20:49:31 GMT, Zhiqiang Zang wrote: >> @CptGit It is due to the fact that GVN runs as soon as the frontend parses the code, during which the graph is incomplete, and you get `outcnt() == 0` because the uses of the node have not been parsed yet. You can defer the transformation to IGVN, which happens later. Thanks. > > @merykitty I removed the use check. Does it look good to you? I did not include test `(x + y) & ~(x + y) => 0` because I found we do not have such idealization in `AndINode` because even `x & ~x => 0` is not supported. > @CptGit Did you updated performance numbers for latest changes? Did they change? @vnkozlov Updated now. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/7376 From kvn at openjdk.org Thu Oct 6 20:16:35 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 20:16:35 GMT Subject: RFR: 8281453: New optimization: convert `~x` into `-1-x` when `~x` is used in an arithmetic expression [v15] In-Reply-To: References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: On Thu, 29 Sep 2022 20:52:41 GMT, Zhiqiang Zang wrote: >> Similar to `(~x)+c` -> `(c-1)-x` and `~(x+c)` -> `(-c-1)-x` in #6858, we can also introduce similar optimizations for subtraction, `c-(~x)` -> `x+(c+1)` and `~(c-x)` -> `x+(-c-1)`. >> >> To generalize, I convert `~x` into `-1-x` when `~x` is used only in arithmetic expression. For example, `c-(~x)` will be converted into `c-(-1-x)` which will match other pattern and will be transformed again in next iteration and finally become `x+(c+1)`. >> >> Also the conversion from `~x` into `-1-x` happens when `x` is an arithmetic expression itself. For example, `~(x+c)` will be transformed into `-1-(x+c)` and eventually `(-c-1)-x`. >> >> The results of the microbenchmark are as follows: >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.440 ? 0.002 ns/op >> NotOpTransformation.baselineLong avgt 60 0.440 ? 0.001 ns/op >> NotOpTransformation.testInt1 avgt 60 0.613 ? 0.006 ns/op >> NotOpTransformation.testInt2 avgt 60 0.868 ? 0.036 ns/op >> NotOpTransformation.testLong1 avgt 60 0.674 ? 0.008 ns/op >> NotOpTransformation.testLong2 avgt 60 0.698 ? 0.006 ns/op >> >> Patch: >> Benchmark Mode Cnt Score Error Units >> NotOpTransformation.baselineInt avgt 60 0.440 ? 0.001 ns/op >> NotOpTransformation.baselineLong avgt 60 0.440 ? 0.001 ns/op >> NotOpTransformation.testInt1 avgt 60 0.329 ? 0.001 ns/op >> NotOpTransformation.testInt2 avgt 60 0.329 ? 0.001 ns/op >> NotOpTransformation.testLong1 avgt 60 0.329 ? 0.001 ns/op >> NotOpTransformation.testLong2 avgt 60 0.329 ? 0.001 ns/op > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove one use check for long as well. Good. It is ready for integration. ------------- PR: https://git.openjdk.org/jdk/pull/7376 From dlong at openjdk.org Thu Oct 6 21:36:37 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 6 Oct 2022 21:36:37 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 13:10:08 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/peephole_x86.cpp line 28: >> >>> 26: #ifdef COMPILER2 >>> 27: >>> 28: #include "opto/peephole.hpp" >> >> I don't see why opto/peephole.hpp is useful. Why not just include peephole_x86.hpp? Then the empty peephole_.hpp for the other platforms are no longer needed. > > `opto/peephole.hpp` is needed from the generated `ad_x86_peephole.cpp` so that `addI_rRegNode::peephole` can call the helper functions. How about including `peephole_.hpp` only when "peepprocedure" is seen, and delete opto/peephole.hpp and empty `peephole_.hpp` files? >> src/hotspot/cpu/x86/peephole_x86.cpp line 50: >> >>> 48: inst1 = inst0->in(1)->as_Mach(); >>> 49: src1 = in; >>> 50: } >> >> I don't understand why this optimization requires MachSpillCopy. Is that the only time we sould see mov+add or mov+shift? > > Yes, `MachSpillCopy` is the node inserted by the register allocator to move values around. In this occasion, a move is inserted because the live range of the input overlaps with the live range of the output of an add or shift instruction. OK. >> src/hotspot/cpu/x86/peephole_x86.cpp line 132: >> >>> 130: cfg_->map_node_to_block(proj, nullptr); >>> 131: cfg_->map_node_to_block(root, block); >>> 132: >> >> A lot of this seems like boiler-plate that could be refactored to make writing new peephole helpers simpler and less error-prone. > > While it seems to be a little boiler-plate, I think a general helper that removes, inserts an arbitrary number of nodes, and connects the graph correctly is hard to write. What do you think? Thanks a lot. How does it work when there is no peepprocedure? I would expect that all the boiler-plate details are taken care of by peepreplace. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From kvn at openjdk.org Thu Oct 6 21:39:34 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 21:39:34 GMT Subject: RFR: 8281453: New optimization: convert `~x` into `-1-x` when `~x` is used in an arithmetic expression [v9] In-Reply-To: References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> <0mZU-PW2DcYPutXuiyKN-fvtsSFk4QNSZuFBjec3ky4=.f5e045d2-c730-48c5-b069-b89450c2e672@github.com> <3yAr7RliyugB8jQqKxeHc99my3zO9RCCy31taABj4bo=.5b2391ec-7241-43cc-9f2f-e40d6f96f7b7@github.com> <9GCtDsjTHPI87UwIzQs_vlu9quyaVIOL788oEkuHHks=.abeaebdb-34f9-49e1-a68b-8e5b974fb109@github.com> Message-ID: On Thu, 6 Oct 2022 19:01:15 GMT, Zhiqiang Zang wrote: >> @merykitty I removed the use check. Does it look good to you? I did not include test `(x + y) & ~(x + y) => 0` because I found we do not have such idealization in `AndINode` because even `x & ~x => 0` is not supported. > >> @CptGit Did you updated performance numbers for latest changes? Did they change? > > @vnkozlov Updated now. Thanks. > @CptGit This pull request has not yet been marked as ready for integration. I think, it is because PR title does not match JBS entry title. ------------- PR: https://git.openjdk.org/jdk/pull/7376 From igor.veresov at oracle.com Thu Oct 6 22:00:37 2022 From: igor.veresov at oracle.com (Igor Veresov) Date: Thu, 6 Oct 2022 22:00:37 +0000 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: References: Message-ID: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> Hi, You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. Igor > On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: > > Hi, > > We would like to pursuit PEA in HotSpot. I spent time thinking how to > adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 > elements in it. 1) flow-sensitive escape analysis 2) lazy code motion > for the allocation and initialization 3) on-the-fly scalar replacement. > The most complex part is 3) and it has done by C2. I'd like to leverage > that, so I come up an idea to focus only on escaped objects in the > algorithm and delegate others to the existing C2 phases. Here is my RFC. > May I get your precious time on this? > > https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 > > The idea is based on the following two observations. > > 1. Stadler's PEA can cooperate with C2 EA/SR. > > If an object moves to the place it is about to escape, it won't impact > C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA > won't do anything for it anyway. > > If PEA don't touch a non-escaped object, it won't change its > escapability. It can punt it to C2 EA/SR and the result is still same. > > > 2. The original AllocationNode is either dead or scalar replaceable > after Stadler's PEA. > > Stadler's algorithm virtualizes an allocation Node and materializes it > on demand. There are 2 places to materialize it. 1) the virtual object > is about to escape 2) MergeProcessor needs to merge an object and at > least one of its predecessor has materialized. MergeProcessor has to > materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). > > We can prove the observation 2 using 'proof of contradiction' here. > Assume the original Allocation node is neither dead nor Scalar Replaced > after Stadler's PEA, and program is still correct. > > Program must need the original allocation node somewhere. The algorithm > has deleted the original allocation node in virtualization step and > never bring it back. It contradicts that the program is still correct. QED. > > > If you're convinced, then we can leverage it. In my design, I don't > virtualize the original node but just leave it there. C2 MacroExpand > phase will take care of the original allocation node as long as it's > either dead or scalar-replaceable. It never get a chance to expand. > > If we restrain on-the-fly scalar replacement in Stadler's PEA, we can > delegate it to C2 EA/SR! There are 3 gains: > > 1) I don't think I can write bug-free Scalar Replacement... > 2) This approach can automatically pick up C2 EA/SR improvements in the > future, such as JDK-8289943. > 3) If we focus only on 'escaped objects', we even don't need to deal > with deoptimization. Only 'scalar replaceable' objects need to save > Object states for deoptimization. Escaped objects disqualify that. > > [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. > "Partial escape analysis and scalar replacement for Java." Proceedings > of Annual IEEE/ACM International Symposium on Code Generation and > Optimization. 2014. > > thanks, > --lx > From duke at openjdk.org Thu Oct 6 22:19:33 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 6 Oct 2022 22:19:33 GMT Subject: Integrated: 8281453: New optimization: convert ~x into -1-x when ~x is used in an arithmetic expression In-Reply-To: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: On Tue, 8 Feb 2022 05:51:37 GMT, Zhiqiang Zang wrote: > Similar to `(~x)+c` -> `(c-1)-x` and `~(x+c)` -> `(-c-1)-x` in #6858, we can also introduce similar optimizations for subtraction, `c-(~x)` -> `x+(c+1)` and `~(c-x)` -> `x+(-c-1)`. > > To generalize, I convert `~x` into `-1-x` when `~x` is used only in arithmetic expression. For example, `c-(~x)` will be converted into `c-(-1-x)` which will match other pattern and will be transformed again in next iteration and finally become `x+(c+1)`. > > Also the conversion from `~x` into `-1-x` happens when `x` is an arithmetic expression itself. For example, `~(x+c)` will be transformed into `-1-(x+c)` and eventually `(-c-1)-x`. > > The results of the microbenchmark are as follows: > > Baseline: > Benchmark Mode Cnt Score Error Units > NotOpTransformation.baselineInt avgt 60 0.440 ? 0.002 ns/op > NotOpTransformation.baselineLong avgt 60 0.440 ? 0.001 ns/op > NotOpTransformation.testInt1 avgt 60 0.613 ? 0.006 ns/op > NotOpTransformation.testInt2 avgt 60 0.868 ? 0.036 ns/op > NotOpTransformation.testLong1 avgt 60 0.674 ? 0.008 ns/op > NotOpTransformation.testLong2 avgt 60 0.698 ? 0.006 ns/op > > Patch: > Benchmark Mode Cnt Score Error Units > NotOpTransformation.baselineInt avgt 60 0.440 ? 0.001 ns/op > NotOpTransformation.baselineLong avgt 60 0.440 ? 0.001 ns/op > NotOpTransformation.testInt1 avgt 60 0.329 ? 0.001 ns/op > NotOpTransformation.testInt2 avgt 60 0.329 ? 0.001 ns/op > NotOpTransformation.testLong1 avgt 60 0.329 ? 0.001 ns/op > NotOpTransformation.testLong2 avgt 60 0.329 ? 0.001 ns/op This pull request has now been integrated. Changeset: 5dd851d8 Author: Zhiqiang Zang Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/5dd851d872c50ef33034c56007c58e6fa69ebd32 Stats: 957 lines in 7 files changed: 562 ins; 376 del; 19 mod 8281453: New optimization: convert ~x into -1-x when ~x is used in an arithmetic expression Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/7376 From sviswanathan at openjdk.org Fri Oct 7 00:08:26 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 7 Oct 2022 00:08:26 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v4] In-Reply-To: References: Message-ID: On Mon, 19 Sep 2022 14:15:19 GMT, Jatin Bhateja wrote: >> Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? >> Currently they are only run for AVX512DQ platforms. > >> Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? Currently they are only run for AVX512DQ platforms. > > I have added missing casting cases AVX/AVX2 and AVX512 targets in existing comprehensive test for [casting](test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java.) @jatin-bhateja Rest of the changes look good to me. Mainly the vector_op_pre_select_sz_estimate() needs to be corrected. ------------- PR: https://git.openjdk.org/jdk/pull/9748 From xxinliu at amazon.com Fri Oct 7 00:09:22 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Thu, 6 Oct 2022 17:09:22 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> Message-ID: hi, Ignor, You are right. Cloning the JVMState of original Allocation Node isn't the correct behavior. I need the JVMState right at materialization. I think it is available because we are in parser. For 2 places of materialization: 1) we are handling the bytecode which causes the object to escape. It's probably putfield/return/invoke. Current JVMState it is. 2) we are in MergeProcessor. We need to materialize a virtual object in its predecessors. We can extract the exiting JVMState from the predecessor Block. I just realize maybe that's the one of the reasons Graal saves 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' when its PEA phase does materialization in high-tier. Apart from safepoint, there's one corner case bothering me. JLS says that creation of a class instance may throw an OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) " space is allocated for the new class instance. If there is insufficient space to allocate the object, evaluation of the class instance creation expression completes abruptly by throwing an OutOfMemoryError. " and it's cross-referenced by bytecode new in JVMS https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new If we have moved the Allocation Node and JVM happens to run out of memory, the first frame of stacktrace will drift a little bit, right? The bci and source linenum will be wrong. Does it matter? I can't imagine that user's programs rely on this information. I think it's possible to amend this bci/line number in JVMState level. I will leave it as an open question and revisit it later. Do I understand your concern? if it makes sense to you, I will update the RFC doc. thanks, --lx On 10/6/22 3:00 PM, Igor Veresov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Hi, > > You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. > > Igor > >> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >> >> Hi, >> >> We would like to pursuit PEA in HotSpot. I spent time thinking how to >> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >> for the allocation and initialization 3) on-the-fly scalar replacement. >> The most complex part is 3) and it has done by C2. I'd like to leverage >> that, so I come up an idea to focus only on escaped objects in the >> algorithm and delegate others to the existing C2 phases. Here is my RFC. >> May I get your precious time on this? >> >> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >> >> The idea is based on the following two observations. >> >> 1. Stadler's PEA can cooperate with C2 EA/SR. >> >> If an object moves to the place it is about to escape, it won't impact >> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >> won't do anything for it anyway. >> >> If PEA don't touch a non-escaped object, it won't change its >> escapability. It can punt it to C2 EA/SR and the result is still same. >> >> >> 2. The original AllocationNode is either dead or scalar replaceable >> after Stadler's PEA. >> >> Stadler's algorithm virtualizes an allocation Node and materializes it >> on demand. There are 2 places to materialize it. 1) the virtual object >> is about to escape 2) MergeProcessor needs to merge an object and at >> least one of its predecessor has materialized. MergeProcessor has to >> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >> >> We can prove the observation 2 using 'proof of contradiction' here. >> Assume the original Allocation node is neither dead nor Scalar Replaced >> after Stadler's PEA, and program is still correct. >> >> Program must need the original allocation node somewhere. The algorithm >> has deleted the original allocation node in virtualization step and >> never bring it back. It contradicts that the program is still correct. QED. >> >> >> If you're convinced, then we can leverage it. In my design, I don't >> virtualize the original node but just leave it there. C2 MacroExpand >> phase will take care of the original allocation node as long as it's >> either dead or scalar-replaceable. It never get a chance to expand. >> >> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >> delegate it to C2 EA/SR! There are 3 gains: >> >> 1) I don't think I can write bug-free Scalar Replacement... >> 2) This approach can automatically pick up C2 EA/SR improvements in the >> future, such as JDK-8289943. >> 3) If we focus only on 'escaped objects', we even don't need to deal >> with deoptimization. Only 'scalar replaceable' objects need to save >> Object states for deoptimization. Escaped objects disqualify that. >> >> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >> "Partial escape analysis and scalar replacement for Java." Proceedings >> of Annual IEEE/ACM International Symposium on Code Generation and >> Optimization. 2014. >> >> thanks, >> --lx >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From vladimir.kozlov at oracle.com Fri Oct 7 00:46:38 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 6 Oct 2022 17:46:38 -0700 Subject: [EXTERNAL]RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> Message-ID: <9b7d753b-5f6b-446b-1391-129ad04a3551@oracle.com> On 10/6/22 5:09 PM, Liu, Xin wrote: > hi, Ignor, > > You are right. Cloning the JVMState of original Allocation Node isn't > the correct behavior. I need the JVMState right at materialization. I > think it is available because we are in parser. For 2 places of > materialization: > 1) we are handling the bytecode which causes the object to escape. It's > probably putfield/return/invoke. Current JVMState it is. > 2) we are in MergeProcessor. We need to materialize a virtual object in > its predecessors. We can extract the exiting JVMState from the > predecessor Block. > > I just realize maybe that's the one of the reasons Graal saves > 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' > when its PEA phase does materialization in high-tier. > > Apart from safepoint, there's one corner case bothering me. JLS says > that creation of a class instance may throw an > OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) > > " > space is allocated for the new class instance. If there is insufficient > space to allocate the object, evaluation of the class instance creation > expression completes abruptly by throwing an OutOfMemoryError. > " > > and it's cross-referenced by bytecode new in JVMS > https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new > > If we have moved the Allocation Node and JVM happens to run out of > memory, the first frame of stacktrace will drift a little bit, right? > The bci and source linenum will be wrong. Does it matter? I can't > imagine that user's programs rely on this information. This is not new [1]. C2 EA implementation has this OOM stacktrace "issue". Graal has it too. Thanks, Vladimir K [1] https://bugs.openjdk.org/browse/JDK-8063642 > > I think it's possible to amend this bci/line number in JVMState level. I > will leave it as an open question and revisit it later. > > Do I understand your concern? if it makes sense to you, I will update > the RFC doc. > > thanks, > --lx > > > > > On 10/6/22 3:00 PM, Igor Veresov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Hi, >> >> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >> >> Igor >> >>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>> >>> Hi, >>> >>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>> for the allocation and initialization 3) on-the-fly scalar replacement. >>> The most complex part is 3) and it has done by C2. I'd like to leverage >>> that, so I come up an idea to focus only on escaped objects in the >>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>> May I get your precious time on this? >>> >>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>> >>> The idea is based on the following two observations. >>> >>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>> >>> If an object moves to the place it is about to escape, it won't impact >>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>> won't do anything for it anyway. >>> >>> If PEA don't touch a non-escaped object, it won't change its >>> escapability. It can punt it to C2 EA/SR and the result is still same. >>> >>> >>> 2. The original AllocationNode is either dead or scalar replaceable >>> after Stadler's PEA. >>> >>> Stadler's algorithm virtualizes an allocation Node and materializes it >>> on demand. There are 2 places to materialize it. 1) the virtual object >>> is about to escape 2) MergeProcessor needs to merge an object and at >>> least one of its predecessor has materialized. MergeProcessor has to >>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>> >>> We can prove the observation 2 using 'proof of contradiction' here. >>> Assume the original Allocation node is neither dead nor Scalar Replaced >>> after Stadler's PEA, and program is still correct. >>> >>> Program must need the original allocation node somewhere. The algorithm >>> has deleted the original allocation node in virtualization step and >>> never bring it back. It contradicts that the program is still correct. QED. >>> >>> >>> If you're convinced, then we can leverage it. In my design, I don't >>> virtualize the original node but just leave it there. C2 MacroExpand >>> phase will take care of the original allocation node as long as it's >>> either dead or scalar-replaceable. It never get a chance to expand. >>> >>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>> delegate it to C2 EA/SR! There are 3 gains: >>> >>> 1) I don't think I can write bug-free Scalar Replacement... >>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>> future, such as JDK-8289943. >>> 3) If we focus only on 'escaped objects', we even don't need to deal >>> with deoptimization. Only 'scalar replaceable' objects need to save >>> Object states for deoptimization. Escaped objects disqualify that. >>> >>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>> "Partial escape analysis and scalar replacement for Java." Proceedings >>> of Annual IEEE/ACM International Symposium on Code Generation and >>> Optimization. 2014. >>> >>> thanks, >>> --lx >>> From xgong at openjdk.org Fri Oct 7 05:40:30 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 7 Oct 2022 05:40:30 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> Message-ID: On Tue, 4 Oct 2022 06:05:27 GMT, Jatin Bhateja wrote: > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! ------------- PR: https://git.openjdk.org/jdk/pull/10192 From yadongwang at openjdk.org Fri Oct 7 09:04:33 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 7 Oct 2022 09:04:33 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v12] In-Reply-To: References: Message-ID: <6xoiOoUbDCCRPGfYz4RIcaKp2pQAVtCbhH4qQgNU-7M=.9df9466b-5145-4aa9-a810-e2cd23b79bf6@github.com> On Thu, 6 Oct 2022 17:18:36 GMT, Sacha Coppey wrote: >> This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. >> It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. More testing is performed in Native Image. > > Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: > > Replace RegisterImpl and FloatRegisterImpl uses by Register and FloatRegister lgtm(not a reviewer) ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/9587 From duke at openjdk.org Fri Oct 7 09:29:56 2022 From: duke at openjdk.org (Sacha Coppey) Date: Fri, 7 Oct 2022 09:29:56 GMT Subject: RFR: 8290154: [JVMCI] partially implement JVMCI for RISC-V [v13] In-Reply-To: References: Message-ID: > This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. > It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. More testing is performed in Native Image. Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: Update a copyright header ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9587/files - new: https://git.openjdk.org/jdk/pull/9587/files/225fee42..e7913fad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=11-12 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From duke at openjdk.org Fri Oct 7 13:12:54 2022 From: duke at openjdk.org (Sacha Coppey) Date: Fri, 7 Oct 2022 13:12:54 GMT Subject: Integrated: 8290154: [JVMCI] partially implement JVMCI for RISC-V In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 10:18:05 GMT, Sacha Coppey wrote: > This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. > It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. More testing is performed in Native Image. This pull request has now been integrated. Changeset: 7a194d31 Author: Sacha Coppey Committer: Doug Simon URL: https://git.openjdk.org/jdk/commit/7a194d31a3f2f78211f035f4591845bf2b465aec Stats: 1722 lines in 20 files changed: 1700 ins; 0 del; 22 mod 8290154: [JVMCI] partially implement JVMCI for RISC-V Reviewed-by: ihse, dnsimon, yadongwang ------------- PR: https://git.openjdk.org/jdk/pull/9587 From qamai at openjdk.org Fri Oct 7 15:08:11 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 7 Oct 2022 15:08:11 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 21:31:40 GMT, Dean Long wrote: >> While it seems to be a little boiler-plate, I think a general helper that removes, inserts an arbitrary number of nodes, and connects the graph correctly is hard to write. What do you think? Thanks a lot. > > How does it work when there is no peepprocedure? I would expect that all the boiler-plate details are taken care of by peepreplace. The code produced by `peepreplace` is really limited, it cannot be applied in this case because both the `MachSpillCopy` and `MachProj` are not match rules. Also, I think a generated code is more powerful than a helper function, so achieving something general with a helper function would be hard. I think we can refactor it into a helper if there are similarities arise later instead. What do you think? ------------- PR: https://git.openjdk.org/jdk/pull/8025 From qamai at openjdk.org Fri Oct 7 15:11:24 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 7 Oct 2022 15:11:24 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 21:33:38 GMT, Dean Long wrote: >> `opto/peephole.hpp` is needed from the generated `ad_x86_peephole.cpp` so that `addI_rRegNode::peephole` can call the helper functions. > > How about including `peephole_.hpp` only when "peepprocedure" is seen, and delete opto/peephole.hpp and empty `peephole_.hpp` files? What do you think if I include `peephole_x86.hpp` in `x86_64.ad` in a source hpp block? This will result in the include appearing in `ad_x86.hpp`, which will be transitively included in `ad_x86_peephole.cpp`. Thanks a lot. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From xxinliu at amazon.com Fri Oct 7 17:28:20 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Fri, 7 Oct 2022 10:28:20 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <9b7d753b-5f6b-446b-1391-129ad04a3551@oracle.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <9b7d753b-5f6b-446b-1391-129ad04a3551@oracle.com> Message-ID: <5c06af38-5efb-338f-b3aa-4f356c2bdcd6@amazon.com> hi, Vladimir Kozlov, Thank you for the head-up. I don't have permission to access the issue. https://bugs.openjdk.org/browse/JDK-8063642 Maybe it is still confidential? I even can't find a clue in git log of openjdk repo. I will try to figure out how C2 EA handles this case. Update from yesterday: I guess it may be more than just OOME. JFR event AllocationInNewTLAB probably dumps the stacktrace at allocation-site in the same way. The drift may confuse Java developers who are profiling allocation. thanks, --lx On 10/6/22 5:46 PM, Vladimir Kozlov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 10/6/22 5:09 PM, Liu, Xin wrote: >> hi, Ignor, >> >> You are right. Cloning the JVMState of original Allocation Node isn't >> the correct behavior. I need the JVMState right at materialization. I >> think it is available because we are in parser. For 2 places of >> materialization: >> 1) we are handling the bytecode which causes the object to escape. It's >> probably putfield/return/invoke. Current JVMState it is. >> 2) we are in MergeProcessor. We need to materialize a virtual object in >> its predecessors. We can extract the exiting JVMState from the >> predecessor Block. >> >> I just realize maybe that's the one of the reasons Graal saves >> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >> when its PEA phase does materialization in high-tier. >> >> Apart from safepoint, there's one corner case bothering me. JLS says >> that creation of a class instance may throw an >> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >> >> " >> space is allocated for the new class instance. If there is insufficient >> space to allocate the object, evaluation of the class instance creation >> expression completes abruptly by throwing an OutOfMemoryError. >> " >> >> and it's cross-referenced by bytecode new in JVMS >> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >> >> If we have moved the Allocation Node and JVM happens to run out of >> memory, the first frame of stacktrace will drift a little bit, right? >> The bci and source linenum will be wrong. Does it matter? I can't >> imagine that user's programs rely on this information. > > This is not new [1]. C2 EA implementation has this OOM stacktrace "issue". Graal has it too. > > Thanks, > Vladimir K > > [1] https://bugs.openjdk.org/browse/JDK-8063642 > >> >> I think it's possible to amend this bci/line number in JVMState level. I >> will leave it as an open question and revisit it later. >> >> Do I understand your concern? if it makes sense to you, I will update >> the RFC doc. >> >> thanks, >> --lx >> >> >> >> >> On 10/6/22 3:00 PM, Igor Veresov wrote: >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>> >>> >>> >>> Hi, >>> >>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>> >>> Igor >>> >>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>> >>>> Hi, >>>> >>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>> that, so I come up an idea to focus only on escaped objects in the >>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>> May I get your precious time on this? >>>> >>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>> >>>> The idea is based on the following two observations. >>>> >>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>> >>>> If an object moves to the place it is about to escape, it won't impact >>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>> won't do anything for it anyway. >>>> >>>> If PEA don't touch a non-escaped object, it won't change its >>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>> >>>> >>>> 2. The original AllocationNode is either dead or scalar replaceable >>>> after Stadler's PEA. >>>> >>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>> least one of its predecessor has materialized. MergeProcessor has to >>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>> >>>> We can prove the observation 2 using 'proof of contradiction' here. >>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>> after Stadler's PEA, and program is still correct. >>>> >>>> Program must need the original allocation node somewhere. The algorithm >>>> has deleted the original allocation node in virtualization step and >>>> never bring it back. It contradicts that the program is still correct. QED. >>>> >>>> >>>> If you're convinced, then we can leverage it. In my design, I don't >>>> virtualize the original node but just leave it there. C2 MacroExpand >>>> phase will take care of the original allocation node as long as it's >>>> either dead or scalar-replaceable. It never get a chance to expand. >>>> >>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>> delegate it to C2 EA/SR! There are 3 gains: >>>> >>>> 1) I don't think I can write bug-free Scalar Replacement... >>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>> future, such as JDK-8289943. >>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>> Object states for deoptimization. Escaped objects disqualify that. >>>> >>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>> Optimization. 2014. >>>> >>>> thanks, >>>> --lx >>>> -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From igor.veresov at oracle.com Fri Oct 7 17:37:37 2022 From: igor.veresov at oracle.com (Igor Veresov) Date: Fri, 7 Oct 2022 17:37:37 +0000 Subject: [External] : Re: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> Message-ID: <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? igor > On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: > > hi, Ignor, > > You are right. Cloning the JVMState of original Allocation Node isn't > the correct behavior. I need the JVMState right at materialization. I > think it is available because we are in parser. For 2 places of > materialization: > 1) we are handling the bytecode which causes the object to escape. It's > probably putfield/return/invoke. Current JVMState it is. > 2) we are in MergeProcessor. We need to materialize a virtual object in > its predecessors. We can extract the exiting JVMState from the > predecessor Block. > > I just realize maybe that's the one of the reasons Graal saves > 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' > when its PEA phase does materialization in high-tier. > > Apart from safepoint, there's one corner case bothering me. JLS says > that creation of a class instance may throw an > OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) > > " > space is allocated for the new class instance. If there is insufficient > space to allocate the object, evaluation of the class instance creation > expression completes abruptly by throwing an OutOfMemoryError. > " > > and it's cross-referenced by bytecode new in JVMS > https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new > > If we have moved the Allocation Node and JVM happens to run out of > memory, the first frame of stacktrace will drift a little bit, right? > The bci and source linenum will be wrong. Does it matter? I can't > imagine that user's programs rely on this information. > > I think it's possible to amend this bci/line number in JVMState level. I > will leave it as an open question and revisit it later. > > Do I understand your concern? if it makes sense to you, I will update > the RFC doc. > > thanks, > --lx > > > > > On 10/6/22 3:00 PM, Igor Veresov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Hi, >> >> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >> >> Igor >> >>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>> >>> Hi, >>> >>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>> for the allocation and initialization 3) on-the-fly scalar replacement. >>> The most complex part is 3) and it has done by C2. I'd like to leverage >>> that, so I come up an idea to focus only on escaped objects in the >>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>> May I get your precious time on this? >>> >>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>> >>> The idea is based on the following two observations. >>> >>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>> >>> If an object moves to the place it is about to escape, it won't impact >>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>> won't do anything for it anyway. >>> >>> If PEA don't touch a non-escaped object, it won't change its >>> escapability. It can punt it to C2 EA/SR and the result is still same. >>> >>> >>> 2. The original AllocationNode is either dead or scalar replaceable >>> after Stadler's PEA. >>> >>> Stadler's algorithm virtualizes an allocation Node and materializes it >>> on demand. There are 2 places to materialize it. 1) the virtual object >>> is about to escape 2) MergeProcessor needs to merge an object and at >>> least one of its predecessor has materialized. MergeProcessor has to >>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>> >>> We can prove the observation 2 using 'proof of contradiction' here. >>> Assume the original Allocation node is neither dead nor Scalar Replaced >>> after Stadler's PEA, and program is still correct. >>> >>> Program must need the original allocation node somewhere. The algorithm >>> has deleted the original allocation node in virtualization step and >>> never bring it back. It contradicts that the program is still correct. QED. >>> >>> >>> If you're convinced, then we can leverage it. In my design, I don't >>> virtualize the original node but just leave it there. C2 MacroExpand >>> phase will take care of the original allocation node as long as it's >>> either dead or scalar-replaceable. It never get a chance to expand. >>> >>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>> delegate it to C2 EA/SR! There are 3 gains: >>> >>> 1) I don't think I can write bug-free Scalar Replacement... >>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>> future, such as JDK-8289943. >>> 3) If we focus only on 'escaped objects', we even don't need to deal >>> with deoptimization. Only 'scalar replaceable' objects need to save >>> Object states for deoptimization. Escaped objects disqualify that. >>> >>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>> "Partial escape analysis and scalar replacement for Java." Proceedings >>> of Annual IEEE/ACM International Symposium on Code Generation and >>> Optimization. 2014. >>> >>> thanks, >>> --lx >>> > From dlong at openjdk.org Fri Oct 7 20:10:45 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 7 Oct 2022 20:10:45 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 15:09:05 GMT, Quan Anh Mai wrote: >> How about including `peephole_.hpp` only when "peepprocedure" is seen, and delete opto/peephole.hpp and empty `peephole_.hpp` files? > > What do you think if I include `peephole_x86.hpp` in `x86_64.ad` in a source hpp block? This will result in the include appearing in `ad_x86.hpp`, which will be transitively included in `ad_x86_peephole.cpp`. Thanks a lot. Yes, good idea. >> How does it work when there is no peepprocedure? I would expect that all the boiler-plate details are taken care of by peepreplace. > > The code produced by `peepreplace` is really limited, it cannot be applied in this case because both the `MachSpillCopy` and `MachProj` are not match rules. Also, I think a generated code is more powerful than a helper function, so achieving something general with a helper function would be hard. I think we can refactor it into a helper if there are similarities arise later instead. What do you think? OK, agreed. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From vladimir.kozlov at oracle.com Fri Oct 7 20:21:44 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 7 Oct 2022 13:21:44 -0700 Subject: [External] : Re: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> Message-ID: On 10/7/22 10:37 AM, Igor Veresov wrote: > The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. > > I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. Current C2 EA [1] looks for latest stores (or initial values) to the object (which has unique Aloccation node id) staring from Safepoint memory input when we replace Allocate with SafePointScalarObject. You would need to use VirtualObject node id as unique instance id. And you need to create separate memory slices for it as we do in EA for Allocation node. Vladimir K [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L452 > > igor > >> On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: >> >> hi, Ignor, >> >> You are right. Cloning the JVMState of original Allocation Node isn't >> the correct behavior. I need the JVMState right at materialization. I >> think it is available because we are in parser. For 2 places of >> materialization: >> 1) we are handling the bytecode which causes the object to escape. It's >> probably putfield/return/invoke. Current JVMState it is. >> 2) we are in MergeProcessor. We need to materialize a virtual object in >> its predecessors. We can extract the exiting JVMState from the >> predecessor Block. >> >> I just realize maybe that's the one of the reasons Graal saves >> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >> when its PEA phase does materialization in high-tier. >> >> Apart from safepoint, there's one corner case bothering me. JLS says >> that creation of a class instance may throw an >> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >> >> " >> space is allocated for the new class instance. If there is insufficient >> space to allocate the object, evaluation of the class instance creation >> expression completes abruptly by throwing an OutOfMemoryError. >> " >> >> and it's cross-referenced by bytecode new in JVMS >> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >> >> If we have moved the Allocation Node and JVM happens to run out of >> memory, the first frame of stacktrace will drift a little bit, right? >> The bci and source linenum will be wrong. Does it matter? I can't >> imagine that user's programs rely on this information. >> >> I think it's possible to amend this bci/line number in JVMState level. I >> will leave it as an open question and revisit it later. >> >> Do I understand your concern? if it makes sense to you, I will update >> the RFC doc. >> >> thanks, >> --lx >> >> >> >> >> On 10/6/22 3:00 PM, Igor Veresov wrote: >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>> >>> >>> >>> Hi, >>> >>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>> >>> Igor >>> >>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>> >>>> Hi, >>>> >>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>> that, so I come up an idea to focus only on escaped objects in the >>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>> May I get your precious time on this? >>>> >>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>> >>>> The idea is based on the following two observations. >>>> >>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>> >>>> If an object moves to the place it is about to escape, it won't impact >>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>> won't do anything for it anyway. >>>> >>>> If PEA don't touch a non-escaped object, it won't change its >>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>> >>>> >>>> 2. The original AllocationNode is either dead or scalar replaceable >>>> after Stadler's PEA. >>>> >>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>> least one of its predecessor has materialized. MergeProcessor has to >>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>> >>>> We can prove the observation 2 using 'proof of contradiction' here. >>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>> after Stadler's PEA, and program is still correct. >>>> >>>> Program must need the original allocation node somewhere. The algorithm >>>> has deleted the original allocation node in virtualization step and >>>> never bring it back. It contradicts that the program is still correct. QED. >>>> >>>> >>>> If you're convinced, then we can leverage it. In my design, I don't >>>> virtualize the original node but just leave it there. C2 MacroExpand >>>> phase will take care of the original allocation node as long as it's >>>> either dead or scalar-replaceable. It never get a chance to expand. >>>> >>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>> delegate it to C2 EA/SR! There are 3 gains: >>>> >>>> 1) I don't think I can write bug-free Scalar Replacement... >>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>> future, such as JDK-8289943. >>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>> Object states for deoptimization. Escaped objects disqualify that. >>>> >>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>> Optimization. 2014. >>>> >>>> thanks, >>>> --lx >>>> >> > From vlivanov at openjdk.org Fri Oct 7 20:27:20 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 7 Oct 2022 20:27:20 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 20:07:27 GMT, Dean Long wrote: >> What do you think if I include `peephole_x86.hpp` in `x86_64.ad` in a source hpp block? This will result in the include appearing in `ad_x86.hpp`, which will be transitively included in `ad_x86_peephole.cpp`. Thanks a lot. > > Yes, good idea. Considering `peephole_x86.cpp` contains only x64-specific code, I had a suggestion to rename it into `peephole_x86_64.cpp` and move `#ifdef _LP64` into `peephole_x86.hpp` to guard x64-specific declarations. But if you intend to include the header directly from `x86_64.ad`, you can rename the header to `peephole_x86_64.hpp` and get rid of `#ifdef _LP64` completely. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From xxinliu at amazon.com Fri Oct 7 22:26:13 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Fri, 7 Oct 2022 15:26:13 -0700 Subject: [External] : Re: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> Message-ID: <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> Hi, Igor and Vladimir, I am not inventing anything new. All I am thinking is how to adapt Stadler's algorithm to C2. All innovation belong to the author. Figure-3 of my RFC is a copy of Listing-7 in his paper. Allow me to repeat his data structure here. I drop "class Id" because I think I can use AllocationNode pointer or even node idx instead. // this is per allocation, identified by 'Id'. class VirtualState: extends ObjectState { int lockCount; Node[] entries; }; // this is per basic-block class State { Map state; Map alias; }; In basic block, PEA keeps tracking the allocation state of an object using VirtualState. In his paper, Figure-4 (b) and (e) depict how the algorithm tracks stores. To get flow-sensitive information, Stadler iterates the scheduled nodes in a basic block. I propose to iterate bytecodes within a basic block. > when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? >> Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. I plan to follow suit and track stores in parser! I also need to create a unique memory slice when I have to materialize a virtual object. This is for InitializeNode and I need to initialize the object to the cumulative state. thanks, --lx On 10/7/22 1:21 PM, Vladimir Kozlov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 10/7/22 10:37 AM, Igor Veresov wrote: >> The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. >> >> I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? > > Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. > > Current C2 EA [1] looks for latest stores (or initial values) to the object (which has unique Aloccation node id) > staring from Safepoint memory input when we replace Allocate with SafePointScalarObject. > > You would need to use VirtualObject node id as unique instance id. And you need to create separate memory slices for it > as we do in EA for Allocation node. > > Vladimir K > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L452 > >> >> igor >> >>> On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: >>> >>> hi, Ignor, >>> >>> You are right. Cloning the JVMState of original Allocation Node isn't >>> the correct behavior. I need the JVMState right at materialization. I >>> think it is available because we are in parser. For 2 places of >>> materialization: >>> 1) we are handling the bytecode which causes the object to escape. It's >>> probably putfield/return/invoke. Current JVMState it is. >>> 2) we are in MergeProcessor. We need to materialize a virtual object in >>> its predecessors. We can extract the exiting JVMState from the >>> predecessor Block. >>> >>> I just realize maybe that's the one of the reasons Graal saves >>> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >>> when its PEA phase does materialization in high-tier. >>> >>> Apart from safepoint, there's one corner case bothering me. JLS says >>> that creation of a class instance may throw an >>> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >>> >>> " >>> space is allocated for the new class instance. If there is insufficient >>> space to allocate the object, evaluation of the class instance creation >>> expression completes abruptly by throwing an OutOfMemoryError. >>> " >>> >>> and it's cross-referenced by bytecode new in JVMS >>> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >>> >>> If we have moved the Allocation Node and JVM happens to run out of >>> memory, the first frame of stacktrace will drift a little bit, right? >>> The bci and source linenum will be wrong. Does it matter? I can't >>> imagine that user's programs rely on this information. >>> >>> I think it's possible to amend this bci/line number in JVMState level. I >>> will leave it as an open question and revisit it later. >>> >>> Do I understand your concern? if it makes sense to you, I will update >>> the RFC doc. >>> >>> thanks, >>> --lx >>> >>> >>> >>> >>> On 10/6/22 3:00 PM, Igor Veresov wrote: >>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>> >>>> >>>> >>>> Hi, >>>> >>>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>>> >>>> Igor >>>> >>>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>>> >>>>> Hi, >>>>> >>>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>>> that, so I come up an idea to focus only on escaped objects in the >>>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>>> May I get your precious time on this? >>>>> >>>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>>> >>>>> The idea is based on the following two observations. >>>>> >>>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>>> >>>>> If an object moves to the place it is about to escape, it won't impact >>>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>>> won't do anything for it anyway. >>>>> >>>>> If PEA don't touch a non-escaped object, it won't change its >>>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>>> >>>>> >>>>> 2. The original AllocationNode is either dead or scalar replaceable >>>>> after Stadler's PEA. >>>>> >>>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>>> least one of its predecessor has materialized. MergeProcessor has to >>>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>>> >>>>> We can prove the observation 2 using 'proof of contradiction' here. >>>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>>> after Stadler's PEA, and program is still correct. >>>>> >>>>> Program must need the original allocation node somewhere. The algorithm >>>>> has deleted the original allocation node in virtualization step and >>>>> never bring it back. It contradicts that the program is still correct. QED. >>>>> >>>>> >>>>> If you're convinced, then we can leverage it. In my design, I don't >>>>> virtualize the original node but just leave it there. C2 MacroExpand >>>>> phase will take care of the original allocation node as long as it's >>>>> either dead or scalar-replaceable. It never get a chance to expand. >>>>> >>>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>>> delegate it to C2 EA/SR! There are 3 gains: >>>>> >>>>> 1) I don't think I can write bug-free Scalar Replacement... >>>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>>> future, such as JDK-8289943. >>>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>>> Object states for deoptimization. Escaped objects disqualify that. >>>>> >>>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>>> Optimization. 2014. >>>>> >>>>> thanks, >>>>> --lx >>>>> >>> >> -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From xgong at openjdk.org Sat Oct 8 15:36:32 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Sat, 8 Oct 2022 15:36:32 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Mon, 19 Sep 2022 08:51:24 GMT, Xiaohong Gong wrote: > "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. > > This patch adds the vector intrinsic implementation of it. The steps are: > > 1) Load the const "iota" vector. > > We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. > > 2) Compute indexes with "`vec + iota * scale`" > > Here is the performance result to the new added micro benchmark on ARM NEON: > > Benchmark Gain > IndexVectorBenchmark.byteIndexVector 1.477 > IndexVectorBenchmark.doubleIndexVector 5.031 > IndexVectorBenchmark.floatIndexVector 5.342 > IndexVectorBenchmark.intIndexVector 5.529 > IndexVectorBenchmark.longIndexVector 3.177 > IndexVectorBenchmark.shortIndexVector 5.841 > > > Please help to review and share the feedback! Thanks in advance! Ping again, could anyone please help to take a review at this PR? Thanks in advance! ------------- PR: https://git.openjdk.org/jdk/pull/10332 From xgong at openjdk.org Sat Oct 8 15:38:59 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Sat, 8 Oct 2022 15:38:59 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v7] In-Reply-To: References: Message-ID: > The current implementation of the vector mask cast operation is > complex that the compiler generates different patterns for different > scenarios. For architectures that do not support the predicate > feature, vector mask is represented the same as the normal vector. > So the vector mask cast is implemented by `VectorCast `node. But this > is not always needed. When two masks have the same element size (e.g. > int vs. float), their bits layout are the same. So casting between > them does not need to emit any instructions. > > Currently the compiler generates different patterns based on the > vector type of the input/output and the platforms. Normally the > "`VectorMaskCast`" op is only used for cases that doesn't emit any > instructions, and "`VectorCast`" op is used to implement the necessary > expand/narrow operations. This can avoid adding some duplicate rules > in the backend. However, this also has the drawbacks: > > 1) The codes are complex, especially when the compiler needs to > check whether the hardware supports the necessary IRs for the > vector mask cast. It needs to check different patterns for > different cases. > 2) The vector mask cast operation could be implemented with cheaper > instructions than the vector casting on some architectures. > > Instead of generating `VectorCast `or `VectorMaskCast `nodes for different > cases of vector mask cast operations, this patch unifies the vector > mask cast implementation with "`VectorMaskCast`" node for all vector types > and platforms. The missing backend rules are also added for it. > > This patch also simplies the vector mask conversion happened in > "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can > be optimized to "`vmask`" if the unboxing type matches with the boxed > "`vmask`" type. Otherwise, it needs the type conversion. Currently the > "`VectorUnbox`" will be transformed to two different patterns to implement > the conversion: > > 1) If the element size is not changed, it is transformed to: > > "VectorMaskCast vmask" > > 2) Otherwise, it is transformed to: > > "VectorLoadMask (VectorStoreMask vmask)" > > It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", > and then uses "`VectorLoadMask`" to convert the boolean vector to the > dst mask vector. Since this patch makes "`VectorMaskCast`" op supported > for all types on all platforms, it doesn't need the "`VectorLoadMask`" and > "`VectorStoreMask`" to do the conversion. The existing transformation: > > VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) > > can be simplified to: > > VectorUnbox (VectorBox vmask) => VectorMaskCast vmask Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Change to use "avx512vl" cpu feature for some IR tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10192/files - new: https://git.openjdk.org/jdk/pull/10192/files/87f81b61..533a3445 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10192&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10192&range=05-06 Stats: 26 lines in 1 file changed: 0 ins; 0 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/10192.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10192/head:pull/10192 PR: https://git.openjdk.org/jdk/pull/10192 From xgong at openjdk.org Sat Oct 8 15:39:02 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Sat, 8 Oct 2022 15:39:02 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> Message-ID: On Tue, 4 Oct 2022 06:05:27 GMT, Jatin Bhateja wrote: >> Hi @XiaohongGong , Thanks!, changes looks good to me, an IR framework test will complement the patch. > >> Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > > > > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? Thanks a lot! ------------- PR: https://git.openjdk.org/jdk/pull/10192 From qamai at openjdk.org Sat Oct 8 15:42:31 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 8 Oct 2022 15:42:31 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v5] In-Reply-To: References: Message-ID: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: refactor includes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8025/files - new: https://git.openjdk.org/jdk/pull/8025/files/d5928f02..566b8dd1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=03-04 Stats: 182 lines in 9 files changed: 6 ins; 175 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/8025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8025/head:pull/8025 PR: https://git.openjdk.org/jdk/pull/8025 From qamai at openjdk.org Sat Oct 8 15:42:31 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 8 Oct 2022 15:42:31 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 20:23:28 GMT, Vladimir Ivanov wrote: >> Yes, good idea. > > Considering `peephole_x86.cpp` contains only x64-specific code, I had a suggestion to rename it into `peephole_x86_64.cpp` and move `#ifdef _LP64` into `peephole_x86.hpp` to guard x64-specific declarations. But if you intend to include the header directly from `x86_64.ad`, you can rename the header to `peephole_x86_64.hpp` and get rid of `#ifdef _LP64` completely. I have refactored that and removed `opto/peephole.hpp` as well as other peephole headers. `peephole_x86.cpp` is also renamed to `peephole_x86_64.cpp`. Thanks very much. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From epeter at openjdk.org Sun Oct 9 05:38:53 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Sun, 9 Oct 2022 05:38:53 GMT Subject: RFR: 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 In-Reply-To: References: Message-ID: On Sat, 8 Oct 2022 00:13:04 GMT, Jie Fu wrote: > Since we need to set `-XX:StressLongCountedLoop=0` to avoid timeout of the test. > So only run it with debug VMs. > > Thanks. > Best regards, > Jie @DamonFool First, sorry for the silly mistake and thanks for fixing it. If we still want to run this test on develop machines, we could alternatively simply use `-XX:+IgnoreUnrecognizedVMOptions`. On product builds the default is already `-XX:StressLongCountedLoop=0`. ------------- PR: https://git.openjdk.org/jdk/pull/10617 From jiefu at openjdk.org Sun Oct 9 06:11:18 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sun, 9 Oct 2022 06:11:18 GMT Subject: RFR: 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 [v2] In-Reply-To: References: Message-ID: > Since we need to set `-XX:StressLongCountedLoop=0` to avoid timeout of the test. > So only run it with debug VMs. > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Use IgnoreUnrecognizedVMOptions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10617/files - new: https://git.openjdk.org/jdk/pull/10617/files/c7af3487..1f78147e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10617&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10617&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10617.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10617/head:pull/10617 PR: https://git.openjdk.org/jdk/pull/10617 From jiefu at openjdk.org Sun Oct 9 06:14:54 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sun, 9 Oct 2022 06:14:54 GMT Subject: RFR: 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 In-Reply-To: References: Message-ID: On Sun, 9 Oct 2022 05:36:46 GMT, Emanuel Peter wrote: > If we still want to run this test on develop machines, we could alternatively simply use `-XX:+IgnoreUnrecognizedVMOptions`. On product builds the default is already `-XX:StressLongCountedLoop=0`. Thanks @eme64 for the review. Right. Since we can't change the default value of `StressLongCountedLoop` in product VMs, it's safe to do so. Updated. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10617 From jbhateja at openjdk.org Mon Oct 10 03:00:40 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Oct 2022 03:00:40 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> Message-ID: <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> On Tue, 4 Oct 2022 06:05:27 GMT, Jatin Bhateja wrote: >> Hi @XiaohongGong , Thanks!, changes looks good to me, an IR framework test will complement the patch. > >> Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > > > > > > > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > > > > > > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! > > Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java Thanks a lot! Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. Newly introduced @Warmup annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). ------------- PR: https://git.openjdk.org/jdk/pull/10192 From jbhateja at openjdk.org Mon Oct 10 03:06:52 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Oct 2022 03:06:52 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v7] In-Reply-To: References: Message-ID: On Sat, 8 Oct 2022 15:38:59 GMT, Xiaohong Gong wrote: >> The current implementation of the vector mask cast operation is >> complex that the compiler generates different patterns for different >> scenarios. For architectures that do not support the predicate >> feature, vector mask is represented the same as the normal vector. >> So the vector mask cast is implemented by `VectorCast `node. But this >> is not always needed. When two masks have the same element size (e.g. >> int vs. float), their bits layout are the same. So casting between >> them does not need to emit any instructions. >> >> Currently the compiler generates different patterns based on the >> vector type of the input/output and the platforms. Normally the >> "`VectorMaskCast`" op is only used for cases that doesn't emit any >> instructions, and "`VectorCast`" op is used to implement the necessary >> expand/narrow operations. This can avoid adding some duplicate rules >> in the backend. However, this also has the drawbacks: >> >> 1) The codes are complex, especially when the compiler needs to >> check whether the hardware supports the necessary IRs for the >> vector mask cast. It needs to check different patterns for >> different cases. >> 2) The vector mask cast operation could be implemented with cheaper >> instructions than the vector casting on some architectures. >> >> Instead of generating `VectorCast `or `VectorMaskCast `nodes for different >> cases of vector mask cast operations, this patch unifies the vector >> mask cast implementation with "`VectorMaskCast`" node for all vector types >> and platforms. The missing backend rules are also added for it. >> >> This patch also simplies the vector mask conversion happened in >> "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can >> be optimized to "`vmask`" if the unboxing type matches with the boxed >> "`vmask`" type. Otherwise, it needs the type conversion. Currently the >> "`VectorUnbox`" will be transformed to two different patterns to implement >> the conversion: >> >> 1) If the element size is not changed, it is transformed to: >> >> "VectorMaskCast vmask" >> >> 2) Otherwise, it is transformed to: >> >> "VectorLoadMask (VectorStoreMask vmask)" >> >> It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", >> and then uses "`VectorLoadMask`" to convert the boolean vector to the >> dst mask vector. Since this patch makes "`VectorMaskCast`" op supported >> for all types on all platforms, it doesn't need the "`VectorLoadMask`" and >> "`VectorStoreMask`" to do the conversion. The existing transformation: >> >> VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) >> >> can be simplified to: >> >> VectorUnbox (VectorBox vmask) => VectorMaskCast vmask > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Change to use "avx512vl" cpu feature for some IR tests Rest of the common IR and X86 backend changes looks good to me, you may need a second review clearance. Please remove additional warmup introduced in tests. ------------- Marked as reviewed by jbhateja (Reviewer). PR: https://git.openjdk.org/jdk/pull/10192 From xgong at openjdk.org Mon Oct 10 03:13:53 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 10 Oct 2022 03:13:53 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> Message-ID: On Mon, 10 Oct 2022 02:57:17 GMT, Jatin Bhateja wrote: >>> Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >> >> Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >> >> since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > >> > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >> > > >> > > >> > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >> > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. >> > >> > >> > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! >> >> Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to > `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? > Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java > Thanks a lot! > > Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. > Newly introduced @Warmup annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > > > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > > > > > > > > > > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > > > > > > > > > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! > > > > > > Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to > > `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? > > Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java > > Thanks a lot! > > Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. Newly introduced @WarmUp annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). Hi @jatin-bhateja , thanks for looking at the changes again! Yes, you are right that the fromework has a default warmup (2000). But I'd like to keep the new added 10000 here, because I met the IR check failing issues when I wrote another IR test and set the warmup as 5000 before. To be honest?I don't know why it fails since the method is compiled by C2 but the compiler shows it lost some information, which made the expected IR not generated. So adding the larger warmup is safe for me. WDYT? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10192 From eliu at openjdk.org Mon Oct 10 03:15:55 2022 From: eliu at openjdk.org (Eric Liu) Date: Mon, 10 Oct 2022 03:15:55 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v7] In-Reply-To: References: Message-ID: On Sat, 8 Oct 2022 15:38:59 GMT, Xiaohong Gong wrote: >> The current implementation of the vector mask cast operation is >> complex that the compiler generates different patterns for different >> scenarios. For architectures that do not support the predicate >> feature, vector mask is represented the same as the normal vector. >> So the vector mask cast is implemented by `VectorCast `node. But this >> is not always needed. When two masks have the same element size (e.g. >> int vs. float), their bits layout are the same. So casting between >> them does not need to emit any instructions. >> >> Currently the compiler generates different patterns based on the >> vector type of the input/output and the platforms. Normally the >> "`VectorMaskCast`" op is only used for cases that doesn't emit any >> instructions, and "`VectorCast`" op is used to implement the necessary >> expand/narrow operations. This can avoid adding some duplicate rules >> in the backend. However, this also has the drawbacks: >> >> 1) The codes are complex, especially when the compiler needs to >> check whether the hardware supports the necessary IRs for the >> vector mask cast. It needs to check different patterns for >> different cases. >> 2) The vector mask cast operation could be implemented with cheaper >> instructions than the vector casting on some architectures. >> >> Instead of generating `VectorCast `or `VectorMaskCast `nodes for different >> cases of vector mask cast operations, this patch unifies the vector >> mask cast implementation with "`VectorMaskCast`" node for all vector types >> and platforms. The missing backend rules are also added for it. >> >> This patch also simplies the vector mask conversion happened in >> "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can >> be optimized to "`vmask`" if the unboxing type matches with the boxed >> "`vmask`" type. Otherwise, it needs the type conversion. Currently the >> "`VectorUnbox`" will be transformed to two different patterns to implement >> the conversion: >> >> 1) If the element size is not changed, it is transformed to: >> >> "VectorMaskCast vmask" >> >> 2) Otherwise, it is transformed to: >> >> "VectorLoadMask (VectorStoreMask vmask)" >> >> It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", >> and then uses "`VectorLoadMask`" to convert the boolean vector to the >> dst mask vector. Since this patch makes "`VectorMaskCast`" op supported >> for all types on all platforms, it doesn't need the "`VectorLoadMask`" and >> "`VectorStoreMask`" to do the conversion. The existing transformation: >> >> VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) >> >> can be simplified to: >> >> VectorUnbox (VectorBox vmask) => VectorMaskCast vmask > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Change to use "avx512vl" cpu feature for some IR tests LGTM. ------------- Marked as reviewed by eliu (Committer). PR: https://git.openjdk.org/jdk/pull/10192 From yyang at openjdk.org Mon Oct 10 03:18:54 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 10 Oct 2022 03:18:54 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain In-Reply-To: References: Message-ID: On Mon, 8 Aug 2022 05:23:05 GMT, Tobias Hartmann wrote: > With your fix, correctness still depends on the order in which nodes are processed by IGVN, right? Wouldn't this still reproduce with `-XX:+StressIGVN`? You are right, the correctness of LoadB#971(Ctrl Mem Addr) depends on idealization orders, i.e. Ideal Mem->Ideal Addr->Ideal Mem [Different alias idx, crash] Ideal Addr->Ideal Mem->Ideal Mem [OK] I checked the idealization of Addr but it seems it behaves well. 969 AddP === _ 250 250 585 969 AddP === _ 250 250 585 969 AddP === _ 335 335 585 473 AddP === _ 335 335 585 ![image](https://user-images.githubusercontent.com/5010047/194793775-5295297c-3a96-4c77-8414-09a085e5af89.png) AddP#969 was changed from `byte[int:>=0]:exact+any *` to `byte[int:8]:NotNull:exact[0] *,iid=177` because its input was changed from CastPP#250(`byte[int:>=0]:exact+any *`) to CheckCastPP#225(`byte[int:8]:NotNull:exact[0] *,iid=177`) whose alias type is. Idealization of Mem also behaves well, it steps through MergeMem by alias type of Addr(`byte[int:>=0]:exact+any *` or `byte[int:8]:NotNull:exact[0] *,iid=177`) and changed to 1109 or 119 accordingly. ------------- PR: https://git.openjdk.org/jdk/pull/9777 From qamai at openjdk.org Mon Oct 10 03:22:51 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 10 Oct 2022 03:22:51 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> Message-ID: <4a3bVCBRGd55nwMccEIp1GPukSjjeidEauiWdBHv2ew=.48a2c0d0-f72b-4bb5-bb3a-f8e4841ff40a@github.com> On Mon, 10 Oct 2022 03:11:45 GMT, Xiaohong Gong wrote: >>> > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >>> > > >>> > > >>> > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >>> > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. >>> > >>> > >>> > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! >>> >>> Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to >> `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? >> Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java >> Thanks a lot! >> >> Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. >> Newly introduced @Warmup annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > >> > > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >> > > > >> > > > >> > > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >> > > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. >> > > >> > > >> > > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! >> > >> > >> > Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to >> > `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? >> > Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java >> > Thanks a lot! >> >> Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. Newly introduced @WarmUp annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > > Hi @jatin-bhateja , thanks for looking at the changes again! Yes, you are right that the fromework has a default warmup (2000). But I'd like to keep the new added 10000 here, because I met the IR check failing issues when I wrote another IR test and set the warmup as 5000 before. To be honest?I don't know why it fails since the method is compiled by C2 but the compiler shows it lost some information, which made the expected IR not generated. So adding the larger warmup is safe for me. WDYT? Thanks! @XiaohongGong You can set default warmup iterations using `TestFramework::setDefaultWarmup` instead of annotating all methods. ------------- PR: https://git.openjdk.org/jdk/pull/10192 From xgong at openjdk.org Mon Oct 10 03:27:53 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 10 Oct 2022 03:27:53 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> Message-ID: On Mon, 10 Oct 2022 03:11:45 GMT, Xiaohong Gong wrote: >>> > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >>> > > >>> > > >>> > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >>> > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. >>> > >>> > >>> > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! >>> >>> Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to >> `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? >> Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java >> Thanks a lot! >> >> Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. >> Newly introduced @Warmup annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > >> > > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >> > > > >> > > > >> > > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >> > > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. >> > > >> > > >> > > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! >> > >> > >> > Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to >> > `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? >> > Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java >> > Thanks a lot! >> >> Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. Newly introduced @WarmUp annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > > Hi @jatin-bhateja , thanks for looking at the changes again! Yes, you are right that the fromework has a default warmup (2000). But I'd like to keep the new added 10000 here, because I met the IR check failing issues when I wrote another IR test and set the warmup as 5000 before. To be honest?I don't know why it fails since the method is compiled by C2 but the compiler shows it lost some information, which made the expected IR not generated. So adding the larger warmup is safe for me. WDYT? Thanks! > @XiaohongGong You can set default warmup iterations using `TestFramework::setDefaultWarmup` instead of annotating all methods. Good idea. I will change with this way and try to set a smaller warmup. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10192 From xgong at openjdk.org Mon Oct 10 04:14:57 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 10 Oct 2022 04:14:57 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v8] In-Reply-To: References: Message-ID: > The current implementation of the vector mask cast operation is > complex that the compiler generates different patterns for different > scenarios. For architectures that do not support the predicate > feature, vector mask is represented the same as the normal vector. > So the vector mask cast is implemented by `VectorCast `node. But this > is not always needed. When two masks have the same element size (e.g. > int vs. float), their bits layout are the same. So casting between > them does not need to emit any instructions. > > Currently the compiler generates different patterns based on the > vector type of the input/output and the platforms. Normally the > "`VectorMaskCast`" op is only used for cases that doesn't emit any > instructions, and "`VectorCast`" op is used to implement the necessary > expand/narrow operations. This can avoid adding some duplicate rules > in the backend. However, this also has the drawbacks: > > 1) The codes are complex, especially when the compiler needs to > check whether the hardware supports the necessary IRs for the > vector mask cast. It needs to check different patterns for > different cases. > 2) The vector mask cast operation could be implemented with cheaper > instructions than the vector casting on some architectures. > > Instead of generating `VectorCast `or `VectorMaskCast `nodes for different > cases of vector mask cast operations, this patch unifies the vector > mask cast implementation with "`VectorMaskCast`" node for all vector types > and platforms. The missing backend rules are also added for it. > > This patch also simplies the vector mask conversion happened in > "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can > be optimized to "`vmask`" if the unboxing type matches with the boxed > "`vmask`" type. Otherwise, it needs the type conversion. Currently the > "`VectorUnbox`" will be transformed to two different patterns to implement > the conversion: > > 1) If the element size is not changed, it is transformed to: > > "VectorMaskCast vmask" > > 2) Otherwise, it is transformed to: > > "VectorLoadMask (VectorStoreMask vmask)" > > It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", > and then uses "`VectorLoadMask`" to convert the boolean vector to the > dst mask vector. Since this patch makes "`VectorMaskCast`" op supported > for all types on all platforms, it doesn't need the "`VectorLoadMask`" and > "`VectorStoreMask`" to do the conversion. The existing transformation: > > VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) > > can be simplified to: > > VectorUnbox (VectorBox vmask) => VectorMaskCast vmask Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: - Use "setDefaultWarmup" instead of adding the annotation for each test - Merge branch 'jdk:master' into JDK-8292898 - Change to use "avx512vl" cpu feature for some IR tests - Add the IR test and fix review comments on x86 backend - Remove untaken code paths on x86 match rules - Add assertion to the elem num for mast cast - Merge branch 'jdk:master' into JDK-8292898 - 8292898: [vectorapi] Unify vector mask cast operation - Merge branch 'jdk:master' into JDK-8291600 - Address review comments - ... and 7 more: https://git.openjdk.org/jdk/compare/8713dfa6...3845f926 ------------- Changes: https://git.openjdk.org/jdk/pull/10192/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10192&range=07 Stats: 596 lines in 10 files changed: 289 ins; 141 del; 166 mod Patch: https://git.openjdk.org/jdk/pull/10192.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10192/head:pull/10192 PR: https://git.openjdk.org/jdk/pull/10192 From jbhateja at openjdk.org Mon Oct 10 06:04:31 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Oct 2022 06:04:31 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> Message-ID: On Mon, 10 Oct 2022 02:57:17 GMT, Jatin Bhateja wrote: >>> Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >> >> Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >> >> since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > >> > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! >> > > >> > > >> > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 >> > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. >> > >> > >> > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! >> >> Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to > `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? > Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java > Thanks a lot! > > Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. > Newly introduced @Warmup annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > > > > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > > > > > > > > > > > > > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > > > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > > > > > > > > > > > > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! > > > > > > > > > Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to > > > `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? > > > Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java > > > Thanks a lot! > > > > > > Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. Newly introduced @WarmUp annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > > Hi @jatin-bhateja , thanks for looking at the changes again! Yes, you are right that the fromework has a default warmup (2000). But I'd like to keep the new added 10000 here, because I met the IR check failing issues when I wrote another IR test and set the warmup as 5000 before. To be honest?I don't know why it fails since the method is compiled by C2 but the compiler shows it lost some information, which made the expected IR not generated. So adding the larger warmup is safe for me. WDYT? Thanks! Framework is using white box APIs to enqueue test methods to compile queues from which they will be picked up by respective compilers, so test method compilation here is agnostic to warmup invocation count, but warmup will ensure that some of the closed world assumptions needed for intrinsification are met. Changes still looks good to me. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10192 From xgong at openjdk.org Mon Oct 10 06:15:56 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 10 Oct 2022 06:15:56 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v5] In-Reply-To: References: <9DHNZglc1nl35eN3euIu6naGNrE0TK8BC3Pqo8nO8-k=.b6453f1f-8af3-47d5-86fd-3268da5e5347@github.com> <5qCRwvjc4xVc8ub7swVXg2XibftsERW4kU_ElQnsFz0=.a79cd2cd-8dd6-4471-96cf-cd74922e085d@github.com> <_RyMIzGdooywmMxf9o1A1zh3etNYpKScHPn5-EvoivA=.6c256471-3d7b-4987-b26f-097a282d977e@github.com> Message-ID: <55fN4Wpe8aav4Fz8kwxe62I-1zozT8VMtBgWDMKHw2k=.c728aa35-ec2e-4d0b-a8ac-7112db0793f5@github.com> On Mon, 10 Oct 2022 06:00:38 GMT, Jatin Bhateja wrote: > > > > > > > Hi @jatin-bhateja , the IR test has been added. Could you please help to review again? Thanks a lot! > > > > > > > > > > > > > > > > > > Some of the IR tests like testByte64ToLong512 are currently are failing on KNL due to following check https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L2484 > > > > > > since source and destination ideal types are different (TypeVect vs TypeVectMask), can you kindly change the feature check for relevant IR tests to avx512vl till we remove that limitation. > > > > > > > > > > > > > > > Thanks for pointing out this issue. Sure, I will limit the feature check to "avx512vl" for all the 512 bits related casting. BTW, could you please show me how to run the test with KNL feature? So that I can have an internal test before pushing the changes. Thanks a lot! > > > > > > > > > > > > Hi @jatin-bhateja , the test is updated. I tested it with `-XX:+UseKNLSetting` by adding the flag to > > > > `TestFramework.runWithFlags()` in the main function, and tests pass. Could you please help to check whether it is ok for you? > > > > Thanks!, we can also pass additional flag in JTREG_WHITELIST_FLAGS in TestFramework.java > > > > Thanks a lot! > > > > > > > > > Hi @XiaohongGong , Thanks for addressing my comments, test now passes on KNL platform. Newly introduced @WarmUp annotation in all the tests looks redundant since in NORMAL run-mode framework does the necessary warmup followed by compilation by "C2' (default compiler). > > > > > > Hi @jatin-bhateja , thanks for looking at the changes again! Yes, you are right that the fromework has a default warmup (2000). But I'd like to keep the new added 10000 here, because I met the IR check failing issues when I wrote another IR test and set the warmup as 5000 before. To be honest?I don't know why it fails since the method is compiled by C2 but the compiler shows it lost some information, which made the expected IR not generated. So adding the larger warmup is safe for me. WDYT? Thanks! > > Framework is using white box APIs to enqueue test methods to compile queues from which they will be picked up by respective compilers, so test method compilation here is agnostic to warmup invocation count, but warmup will ensure that some of the closed world assumptions needed for intrinsification are met. > > Changes still looks good to me. Thanks! I see. Thanks a lot for the clarification and reviewing! ------------- PR: https://git.openjdk.org/jdk/pull/10192 From fgao at openjdk.org Mon Oct 10 06:26:16 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 10 Oct 2022 06:26:16 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov Message-ID: After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize the case below by enabling -XX:+UseCMoveUnconditionally and -XX:+UseVectorCmov: // double[] a, double[] b, double[] c; for (int i = 0; i < a.length; i++) { c[i] = (a[i] > b[i]) ? a[i] : b[i]; } But we don't support the case like: // double[] a; // int seed; for (int i = 0; i < a.length; i++) a[i] = (i % 2 == 0) ? seed + i : seed - i; } because the IR nodes for the CMoveD in the loop is: AddI AndI AddD SubD \ / / / CmpI / / \ / / Bool / / \ / / CMoveD and it is not our target pattern, which requires that the inputs of Cmp node must be the same as the inputs of CMove node as commented in CMoveKit::make_cmovevd_pack(). Because we can't vectorize the CMoveD pack, we shouldn't vectorize its inputs, AddD and SubD. But the current function CMoveKit::make_cmovevd_pack() doesn't clear the unqualified CMoveD pack from the packset. In this way, superword wrongly vectorizes AddD and SubD. Finally, we get a scalar CMoveD node with two vector inputs, AddVD and SubVD, which has wrong mixing types, then the assertion fails. To fix it, we need to remove the unvectorized CMoveD pack from the packset and clear related map info. ------------- Commit messages: - 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov Changes: https://git.openjdk.org/jdk/pull/10627/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10627&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293833 Stats: 61 lines in 4 files changed: 42 ins; 5 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/10627.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10627/head:pull/10627 PR: https://git.openjdk.org/jdk/pull/10627 From dzhang at openjdk.org Mon Oct 10 06:39:21 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 10 Oct 2022 06:39:21 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src Message-ID: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> I built hsdis with the following parameters from source code of binutils while cross-compiling: --with-hsdis=binutils \ --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 But configure will exit with the following error: checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': configure: error: cannot run C compiled programs. If you meant to cross compile, use `--host'. See `config.log' for more details configure: Automatic building of binutils failed on configure. Try building it manually configure: error: Cannot continue configure exiting with result code 1 The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 index d72bbf6df32..dddc1cf6a4d 100644 --- a/make/autoconf/lib-hsdis.m4 +++ b/make/autoconf/lib-hsdis.m4 @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], fi else binutils_cc="$CC $SYSROOT_CFLAGS" - binutils_target="" + if test "x$host" = "x$build"; then + binutils_target="" + else + binutils_target="--host=$host" + fi fi binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . ## Testing: - cross compile for RISC-V on x86_64 ------------- Commit messages: - Remove useless code related to hsdis-demo.c in Makefile - Add hsdis-src support for cross-compile Changes: https://git.openjdk.org/jdk/pull/10628/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10628&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295033 Stats: 14 lines in 2 files changed: 4 ins; 8 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10628.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10628/head:pull/10628 PR: https://git.openjdk.org/jdk/pull/10628 From chagedorn at openjdk.org Mon Oct 10 06:47:38 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 10 Oct 2022 06:47:38 GMT Subject: RFR: 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 [v2] In-Reply-To: References: Message-ID: On Sun, 9 Oct 2022 06:11:18 GMT, Jie Fu wrote: >> Since we need to set `-XX:StressLongCountedLoop=0` to avoid timeout of the test. >> So only run it with debug VMs. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Use IgnoreUnrecognizedVMOptions Looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10617 From yyang at openjdk.org Mon Oct 10 06:56:28 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 10 Oct 2022 06:56:28 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v2] In-Reply-To: References: Message-ID: > Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: > > ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) > > The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: > > The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: > > https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 > (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). > > There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). > > 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] > > > After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > The well-formed IR looks like this: > ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) > > Thanks for your patience. Yi Yang has updated the pull request incrementally with one additional commit since the last revision: fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9777/files - new: https://git.openjdk.org/jdk/pull/9777/files/cecb86f8..063d2468 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9777&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9777&range=00-01 Stats: 31 lines in 4 files changed: 18 ins; 8 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/9777.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9777/head:pull/9777 PR: https://git.openjdk.org/jdk/pull/9777 From epeter at openjdk.org Mon Oct 10 06:55:38 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Oct 2022 06:55:38 GMT Subject: RFR: 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 [v2] In-Reply-To: References: Message-ID: On Sun, 9 Oct 2022 06:11:18 GMT, Jie Fu wrote: >> Since we need to set `-XX:StressLongCountedLoop=0` to avoid timeout of the test. >> So only run it with debug VMs. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Use IgnoreUnrecognizedVMOptions Looks good, thanks for fixing it! (FYI: I'm not a reviewer yet) ------------- Marked as reviewed by epeter (Committer). PR: https://git.openjdk.org/jdk/pull/10617 From jiefu at openjdk.org Mon Oct 10 07:06:04 2022 From: jiefu at openjdk.org (Jie Fu) Date: Mon, 10 Oct 2022 07:06:04 GMT Subject: RFR: 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 [v2] In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 06:44:16 GMT, Christian Hagedorn wrote: >> Jie Fu has updated the pull request incrementally with one additional commit since the last revision: >> >> Use IgnoreUnrecognizedVMOptions > > Looks good. Thanks @chhagedorn and @eme64 . ------------- PR: https://git.openjdk.org/jdk/pull/10617 From jiefu at openjdk.org Mon Oct 10 07:10:21 2022 From: jiefu at openjdk.org (Jie Fu) Date: Mon, 10 Oct 2022 07:10:21 GMT Subject: Integrated: 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 In-Reply-To: References: Message-ID: On Sat, 8 Oct 2022 00:13:04 GMT, Jie Fu wrote: > Since we need to set `-XX:StressLongCountedLoop=0` to avoid timeout of the test. > So only run it with debug VMs. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 6ed74ef6 Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/6ed74ef654f0b3e5c748895654d6925e2b832732 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8295005: compiler/loopopts/TestRemoveEmptyLoop.java fails with release VMs after JDK-8294839 Reviewed-by: chagedorn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/10617 From qamai at openjdk.org Mon Oct 10 08:20:58 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 10 Oct 2022 08:20:58 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v8] In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 04:14:57 GMT, Xiaohong Gong wrote: >> The current implementation of the vector mask cast operation is >> complex that the compiler generates different patterns for different >> scenarios. For architectures that do not support the predicate >> feature, vector mask is represented the same as the normal vector. >> So the vector mask cast is implemented by `VectorCast `node. But this >> is not always needed. When two masks have the same element size (e.g. >> int vs. float), their bits layout are the same. So casting between >> them does not need to emit any instructions. >> >> Currently the compiler generates different patterns based on the >> vector type of the input/output and the platforms. Normally the >> "`VectorMaskCast`" op is only used for cases that doesn't emit any >> instructions, and "`VectorCast`" op is used to implement the necessary >> expand/narrow operations. This can avoid adding some duplicate rules >> in the backend. However, this also has the drawbacks: >> >> 1) The codes are complex, especially when the compiler needs to >> check whether the hardware supports the necessary IRs for the >> vector mask cast. It needs to check different patterns for >> different cases. >> 2) The vector mask cast operation could be implemented with cheaper >> instructions than the vector casting on some architectures. >> >> Instead of generating `VectorCast `or `VectorMaskCast `nodes for different >> cases of vector mask cast operations, this patch unifies the vector >> mask cast implementation with "`VectorMaskCast`" node for all vector types >> and platforms. The missing backend rules are also added for it. >> >> This patch also simplies the vector mask conversion happened in >> "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can >> be optimized to "`vmask`" if the unboxing type matches with the boxed >> "`vmask`" type. Otherwise, it needs the type conversion. Currently the >> "`VectorUnbox`" will be transformed to two different patterns to implement >> the conversion: >> >> 1) If the element size is not changed, it is transformed to: >> >> "VectorMaskCast vmask" >> >> 2) Otherwise, it is transformed to: >> >> "VectorLoadMask (VectorStoreMask vmask)" >> >> It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", >> and then uses "`VectorLoadMask`" to convert the boolean vector to the >> dst mask vector. Since this patch makes "`VectorMaskCast`" op supported >> for all types on all platforms, it doesn't need the "`VectorLoadMask`" and >> "`VectorStoreMask`" to do the conversion. The existing transformation: >> >> VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) >> >> can be simplified to: >> >> VectorUnbox (VectorBox vmask) => VectorMaskCast vmask > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: > > - Use "setDefaultWarmup" instead of adding the annotation for each test > - Merge branch 'jdk:master' into JDK-8292898 > - Change to use "avx512vl" cpu feature for some IR tests > - Add the IR test and fix review comments on x86 backend > - Remove untaken code paths on x86 match rules > - Add assertion to the elem num for mast cast > - Merge branch 'jdk:master' into JDK-8292898 > - 8292898: [vectorapi] Unify vector mask cast operation > - Merge branch 'jdk:master' into JDK-8291600 > - Address review comments > - ... and 7 more: https://git.openjdk.org/jdk/compare/8713dfa6...3845f926 Actually I also encountered intrinsification failures while working on [JDK-8259610](https://bugs.openjdk.java.net/browse/JDK-8259610) when setting the warmup iterations too low (the `INVOCATIONS` is set to 10000 in those tests). The cause is unknown to me, probably because some information fails to be propagated through the inlining. This can be seen frequently using `-XX:+PrintIntrinsics`, although the compiler will eventually manage to get the required constant information. As a result, I think setting a warmup iterations of 10000 is alright here. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10192 From chagedorn at openjdk.org Mon Oct 10 08:58:32 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 10 Oct 2022 08:58:32 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 06:12:11 GMT, Fei Gao wrote: > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > > > But we don't support the case like: > > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) { > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > > because the IR nodes for the CMoveD in the loop is: > > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node > as commented in CMoveKit::make_cmovevd_pack(). Because > we can't vectorize the CMoveD pack, we shouldn't vectorize > its inputs, AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD > node with two vector inputs, AddVD and SubVD, which has > wrong mixing types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack > from the packset and clear related map info. Otherwise, the fix looks reasonable to me! src/hotspot/share/opto/superword.cpp line 1981: > 1979: } > 1980: > 1981: Node_List* new_cmpd_pk = new Node_List(); The following suggestion is just an idea as I was a little bit confused by how you use the return value of `make_cmovevd_pack` to remove the cmove pack and its related packs. Intuitively, I would have expected that this "make method" returns the newly created pack instead. Maybe it's cleaner if you split this method into a "should merge" method with the check if ((cmovd->Opcode() != Op_CMoveF && cmovd->Opcode() != Op_CMoveD) || pack(cmovd) != NULL /* already in the cmov pack */) { return NULL; } a "can merge" method that checks all the other constraints and an actual "make pack" method with the code starting at this line. Then you could use these methods in `merge_packs_to_cmovd` like that in pseudo-code: void SuperWord::merge_packs_to_cmovd() { for (int i = _packset.length() - 1; i >= 0; i--) { Node_List* pack = _packset.at(i); if (_cmovev_kit.should_merge(pack)) { if (_cmovev_kit.can_merge(pack)) { _cmovev_kit.make_cmovevd_pack(pack) } else { remove_cmove_and_related_packs(pack); } } } ... ------------- PR: https://git.openjdk.org/jdk/pull/10627 From chagedorn at openjdk.org Mon Oct 10 09:26:57 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 10 Oct 2022 09:26:57 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 06:12:11 GMT, Fei Gao wrote: > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > > > But we don't support the case like: > > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) { > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > > because the IR nodes for the CMoveD in the loop is: > > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node > as commented in CMoveKit::make_cmovevd_pack(). Because > we can't vectorize the CMoveD pack, we shouldn't vectorize > its inputs, AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD > node with two vector inputs, AddVD and SubVD, which has > wrong mixing types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack > from the packset and clear related map info. test/hotspot/jtreg/compiler/c2/TestCondAddDeadBranch.java line 32: > 30: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:CompileOnly=TestCondAddDeadBranch TestCondAddDeadBranch > 31: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:CompileOnly=TestCondAddDeadBranch > 32: * -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov -XX:MaxVectorSize=32 TestCondAddDeadBranch As the cmove flags are C2 specific, we should also add a `@requires vm.compiler2.enabled`. Same for the other test. ------------- PR: https://git.openjdk.org/jdk/pull/10627 From qamai at openjdk.org Mon Oct 10 09:36:55 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 10 Oct 2022 09:36:55 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov In-Reply-To: References: Message-ID: <3L3jBCCbTRD4b1tmu1m3L6kxmZJYWs6S7x9frmO87-k=.0a86f24e-7f81-41b5-9ac8-9e039b3cf240@github.com> On Mon, 10 Oct 2022 06:12:11 GMT, Fei Gao wrote: > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > > > But we don't support the case like: > > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) { > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > > because the IR nodes for the CMoveD in the loop is: > > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node > as commented in CMoveKit::make_cmovevd_pack(). Because > we can't vectorize the CMoveD pack, we shouldn't vectorize > its inputs, AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD > node with two vector inputs, AddVD and SubVD, which has > wrong mixing types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack > from the packset and clear related map info. May I ask if we can vectorise `Bool -> Cmp` into `VectorMaskCmp` and `CMove` into `VectorBlend`, this would help vectorise the pattern you mention in the description instead of bailing out? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10627 From rcastanedalo at openjdk.org Mon Oct 10 12:03:49 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 10 Oct 2022 12:03:49 GMT Subject: RFR: 8294356: IGV: scheduled graphs contain duplicated elements [v2] In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 07:55:31 GMT, Roberto Casta?eda Lozano wrote: >> This changeset removes duplicated nodes and edges from graph dumps that include a control-flow graph: >> ![cfg-before-after](https://user-images.githubusercontent.com/8792647/192294554-73ca3927-dab3-4d8f-9503-0904a4da7434.png) >> This is achieved by ensuring that HotSpot only visits each node once when dumping IGV graphs. >> >> #### Testing >> - Tested that tens of thousands of graphs do not contain duplicated nodes or edges by instrumenting IGV and running `java -Xcomp -XX:PrintIdealGraphLevel=4`. >> - Tested manually that unscheduled graphs are not affected by this changeset. >> - Tested that running compiler tests with `-XX:PrintIdealGraphLevel=3` does not trigger any failure. > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Use nullptr everywhere, remove empty line May I get a second review for this one? The changes are small and debug-only. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10423 From erikj at openjdk.org Mon Oct 10 12:53:54 2022 From: erikj at openjdk.org (Erik Joelsson) Date: Mon, 10 Oct 2022 12:53:54 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Build changes look ok to me. ------------- Marked as reviewed by erikj (Reviewer). PR: https://git.openjdk.org/jdk/pull/10628 From aph at openjdk.org Mon Oct 10 13:34:57 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 10 Oct 2022 13:34:57 GMT Subject: RFR: 8294262: AArch64: compiler/vectorapi/TestReverseByteTransforms.java test failed on SVE machine [v2] In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 00:21:24 GMT, Eric Liu wrote: >> This test failed at cases test_reversebytes_short/int/long_transform2, which expected the ReversBytesV node, but nothing was finally found. On SVE system, we have a specific optimization, `ReverseBytesV (ReverseBytesV X MASK) MASK => X`, which eliminates both ReverseBytesV nodes. This optimization rule is specifically on hardware with native predicate support. See https://github.com/openjdk/jdk/pull/9623 for more details. >> >> As there is an SVE specific case TestReverseByteTransformsSVE.java, this patch simply marks TestReverseByteTransforms.java as non-SVE only. >> >> [TEST] >> jdk/incubator/vector, hotspot/compiler/vectorapi pass on SVE machine > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > add comment > > Change-Id: I4c17256ff656528bbcfcacd2ee2380df6ae14bf1 Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10442 From dzhang at openjdk.org Mon Oct 10 13:36:46 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 10 Oct 2022 13:36:46 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 12:50:08 GMT, Erik Joelsson wrote: >> I built hsdis with the following parameters from source code of binutils while cross-compiling: >> >> --with-hsdis=binutils \ >> --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 >> >> >> But configure will exit with the following error: >> >> checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': >> configure: error: cannot run C compiled programs. >> If you meant to cross compile, use `--host'. >> See `config.log' for more details >> configure: Automatic building of binutils failed on configure. Try building it manually >> configure: error: Cannot continue >> configure exiting with result code 1 >> >> >> The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: >> >> diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 >> index d72bbf6df32..dddc1cf6a4d 100644 >> --- a/make/autoconf/lib-hsdis.m4 >> +++ b/make/autoconf/lib-hsdis.m4 >> @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], >> fi >> else >> binutils_cc="$CC $SYSROOT_CFLAGS" >> - binutils_target="" >> + if test "x$host" = "x$build"; then >> + binutils_target="" >> + else >> + binutils_target="--host=$host" >> + fi >> fi >> binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" >> >> >> >> In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . >> >> ## Testing: >> >> - cross compile for RISC-V on x86_64 > > Build changes look ok to me. @erikj79 Thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/10628 From thartmann at openjdk.org Mon Oct 10 14:06:54 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 10 Oct 2022 14:06:54 GMT Subject: RFR: 8294356: IGV: scheduled graphs contain duplicated elements [v2] In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 07:55:31 GMT, Roberto Casta?eda Lozano wrote: >> This changeset removes duplicated nodes and edges from graph dumps that include a control-flow graph: >> ![cfg-before-after](https://user-images.githubusercontent.com/8792647/192294554-73ca3927-dab3-4d8f-9503-0904a4da7434.png) >> This is achieved by ensuring that HotSpot only visits each node once when dumping IGV graphs. >> >> #### Testing >> - Tested that tens of thousands of graphs do not contain duplicated nodes or edges by instrumenting IGV and running `java -Xcomp -XX:PrintIdealGraphLevel=4`. >> - Tested manually that unscheduled graphs are not affected by this changeset. >> - Tested that running compiler tests with `-XX:PrintIdealGraphLevel=3` does not trigger any failure. > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Use nullptr everywhere, remove empty line Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10423 From rcastanedalo at openjdk.org Mon Oct 10 14:18:45 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 10 Oct 2022 14:18:45 GMT Subject: RFR: 8294356: IGV: scheduled graphs contain duplicated elements [v2] In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 14:03:19 GMT, Tobias Hartmann wrote: > Looks good to me. Thanks for reviewing, Tobias! ------------- PR: https://git.openjdk.org/jdk/pull/10423 From jbhateja at openjdk.org Mon Oct 10 17:16:47 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Oct 2022 17:16:47 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v8] In-Reply-To: References: Message-ID: > Hi All, > > This patch extends conversion optimizations added with [JDK-8287835](https://bugs.openjdk.org/browse/JDK-8287835) to optimize following floating point to integral conversions for X86 AVX2 targets:- > * D2I , D2S, D2B, F2I , F2S, F2B > > In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature. > * D2I, D2S, D2B > > Following are the JMH micro performance results with and without patch. > > System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz) > > BENCHMARK | SIZE | BASELINE (ops/ms) | WITHOPT (ops/ms) | PERF GAIN FACTOR > -- | -- | -- | -- | -- > VectorFPtoIntCastOperations.microDouble128ToByte128 | 1024 | 90.603 | 92.797 | 1.024215534 > VectorFPtoIntCastOperations.microDouble128ToByte256 | 1024 | 81.909 | 82.3 | 1.00477359 > VectorFPtoIntCastOperations.microDouble128ToByte512 | 1024 | 26.181 | 26.244 | 1.002406325 > VectorFPtoIntCastOperations.microDouble128ToInteger128 | 1024 | 90.74 | 2537.958 | 27.96956138 > VectorFPtoIntCastOperations.microDouble128ToInteger256 | 1024 | 81.586 | 2429.599 | 29.7796068 > VectorFPtoIntCastOperations.microDouble128ToInteger512 | 1024 | 19.406 | 19.61 | 1.010512213 > VectorFPtoIntCastOperations.microDouble128ToLong128 | 1024 | 91.723 | 90.754 | 0.989435583 > VectorFPtoIntCastOperations.microDouble128ToShort128 | 1024 | 91.766 | 1984.577 | 21.62649565 > VectorFPtoIntCastOperations.microDouble128ToShort256 | 1024 | 81.949 | 1940.599 | 23.68056962 > VectorFPtoIntCastOperations.microDouble128ToShort512 | 1024 | 16.468 | 16.56 | 1.005586592 > VectorFPtoIntCastOperations.microDouble256ToByte128 | 1024 | 163.331 | 3018.351 | 18.479964 > VectorFPtoIntCastOperations.microDouble256ToByte256 | 1024 | 148.878 | 3082.034 | 20.70174237 > VectorFPtoIntCastOperations.microDouble256ToByte512 | 1024 | 50.108 | 51.629 | 1.030354434 > VectorFPtoIntCastOperations.microDouble256ToInteger128 | 1024 | 159.805 | 4619.421 | 28.90661118 > VectorFPtoIntCastOperations.microDouble256ToInteger256 | 1024 | 143.876 | 4649.642 | 32.31700909 > VectorFPtoIntCastOperations.microDouble256ToInteger512 | 1024 | 38.127 | 38.188 | 1.001599916 > VectorFPtoIntCastOperations.microDouble256ToLong128 | 1024 | 160.322 | 162.442 | 1.013223388 > VectorFPtoIntCastOperations.microDouble256ToLong256 | 1024 | 141.252 | 143.01 | 1.012445841 > VectorFPtoIntCastOperations.microDouble256ToShort128 | 1024 | 157.717 | 3757.471 | 23.82413437 > VectorFPtoIntCastOperations.microDouble256ToShort256 | 1024 | 143.876 | 3830.971 | 26.62689399 > VectorFPtoIntCastOperations.microDouble256ToShort512 | 1024 | 32.061 | 32.911 | 1.026511962 > VectorFPtoIntCastOperations.microFloat128ToByte128 | 1024 | 146.599 | 4002.967 | 27.30555461 > VectorFPtoIntCastOperations.microFloat128ToByte256 | 1024 | 136.99 | 3938.799 | 28.75245638 > VectorFPtoIntCastOperations.microFloat128ToByte512 | 1024 | 51.561 | 50.284 | 0.975233219 > VectorFPtoIntCastOperations.microFloat128ToInteger128 | 1024 | 5933.565 | 5361.472 | 0.903583596 > VectorFPtoIntCastOperations.microFloat128ToInteger256 | 1024 | 5079.564 | 5062.046 | 0.996551279 > VectorFPtoIntCastOperations.microFloat128ToInteger512 | 1024 | 37.101 | 38.419 | 1.035524649 > VectorFPtoIntCastOperations.microFloat128ToLong128 | 1024 | 145.863 | 145.362 | 0.99656527 > VectorFPtoIntCastOperations.microFloat128ToLong256 | 1024 | 131.159 | 133.154 | 1.015210546 > VectorFPtoIntCastOperations.microFloat128ToShort128 | 1024 | 145.966 | 4150.039 | 28.4315457 > VectorFPtoIntCastOperations.microFloat128ToShort256 | 1024 | 134.703 | 4566.589 | 33.90116775 > VectorFPtoIntCastOperations.microFloat128ToShort512 | 1024 | 31.878 | 30.867 | 0.968285338 > VectorFPtoIntCastOperations.microFloat256ToByte128 | 1024 | 237.841 | 6292.051 | 26.4548627 > VectorFPtoIntCastOperations.microFloat256ToByte256 | 1024 | 222.041 | 6292.748 | 28.34047766 > VectorFPtoIntCastOperations.microFloat256ToByte512 | 1024 | 92.073 | 88.981 | 0.966417951 > VectorFPtoIntCastOperations.microFloat256ToInteger128 | 1024 | 11471.121 | 10269.636 | 0.895260019 > VectorFPtoIntCastOperations.microFloat256ToInteger256 | 1024 | 10729.816 | 10105.92 | 0.941853989 > VectorFPtoIntCastOperations.microFloat256ToInteger512 | 1024 | 68.328 | 70.005 | 1.024543379 > VectorFPtoIntCastOperations.microFloat256ToLong128 | 1024 | 247.101 | 248.571 | 1.005948984 > VectorFPtoIntCastOperations.microFloat256ToLong256 | 1024 | 225.74 | 223.987 | 0.992234429 > VectorFPtoIntCastOperations.microFloat256ToLong512 | 1024 | 76.39 | 76.187 | 0.997342584 > VectorFPtoIntCastOperations.microFloat256ToShort128 | 1024 | 233.196 | 8202.179 | 35.17289748 > VectorFPtoIntCastOperations.microFloat256ToShort256 | 1024 | 220.75 | 7781.073 | 35.24834881 > VectorFPtoIntCastOperations.microFloat256ToShort512 | 1024 | 58.143 | 55.633 | 0.956830573 > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8288043: Review comments resolutions. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9748/files - new: https://git.openjdk.org/jdk/pull/9748/files/f54ea603..5fb99e81 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9748&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9748&range=06-07 Stats: 290 lines in 2 files changed: 49 ins; 0 del; 241 mod Patch: https://git.openjdk.org/jdk/pull/9748.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9748/head:pull/9748 PR: https://git.openjdk.org/jdk/pull/9748 From jbhateja at openjdk.org Mon Oct 10 17:16:51 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Oct 2022 17:16:51 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v7] In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 00:28:16 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8288043: Adding descriptive comments. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4611: > >> 4609: void C2_MacroAssembler::vector_castF2L_evex(XMMRegister dst, XMMRegister src, XMMRegister xtmp1, XMMRegister xtmp2, >> 4610: KRegister ktmp1, KRegister ktmp2, AddressLiteral double_sign_flip, >> 4611: Register rscratch, int vec_enc) { > > Need an assert here: > assert(rscratch != noreg || always_reachable(double_sign_flip), "missing"); Hi @sviswa7, assertions are part of leaf level macro assembly routine which is vector_cast_float_to_long_special_cases_evex in this case. ------------- PR: https://git.openjdk.org/jdk/pull/9748 From sviswanathan at openjdk.org Mon Oct 10 18:12:01 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 10 Oct 2022 18:12:01 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v8] In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 17:16:47 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch extends conversion optimizations added with [JDK-8287835](https://bugs.openjdk.org/browse/JDK-8287835) to optimize following floating point to integral conversions for X86 AVX2 targets:- >> * D2I , D2S, D2B, F2I , F2S, F2B >> >> In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature. >> * D2I, D2S, D2B >> >> Following are the JMH micro performance results with and without patch. >> >> System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz) >> >> BENCHMARK | SIZE | BASELINE (ops/ms) | WITHOPT (ops/ms) | PERF GAIN FACTOR >> -- | -- | -- | -- | -- >> VectorFPtoIntCastOperations.microDouble128ToByte128 | 1024 | 90.603 | 92.797 | 1.024215534 >> VectorFPtoIntCastOperations.microDouble128ToByte256 | 1024 | 81.909 | 82.3 | 1.00477359 >> VectorFPtoIntCastOperations.microDouble128ToByte512 | 1024 | 26.181 | 26.244 | 1.002406325 >> VectorFPtoIntCastOperations.microDouble128ToInteger128 | 1024 | 90.74 | 2537.958 | 27.96956138 >> VectorFPtoIntCastOperations.microDouble128ToInteger256 | 1024 | 81.586 | 2429.599 | 29.7796068 >> VectorFPtoIntCastOperations.microDouble128ToInteger512 | 1024 | 19.406 | 19.61 | 1.010512213 >> VectorFPtoIntCastOperations.microDouble128ToLong128 | 1024 | 91.723 | 90.754 | 0.989435583 >> VectorFPtoIntCastOperations.microDouble128ToShort128 | 1024 | 91.766 | 1984.577 | 21.62649565 >> VectorFPtoIntCastOperations.microDouble128ToShort256 | 1024 | 81.949 | 1940.599 | 23.68056962 >> VectorFPtoIntCastOperations.microDouble128ToShort512 | 1024 | 16.468 | 16.56 | 1.005586592 >> VectorFPtoIntCastOperations.microDouble256ToByte128 | 1024 | 163.331 | 3018.351 | 18.479964 >> VectorFPtoIntCastOperations.microDouble256ToByte256 | 1024 | 148.878 | 3082.034 | 20.70174237 >> VectorFPtoIntCastOperations.microDouble256ToByte512 | 1024 | 50.108 | 51.629 | 1.030354434 >> VectorFPtoIntCastOperations.microDouble256ToInteger128 | 1024 | 159.805 | 4619.421 | 28.90661118 >> VectorFPtoIntCastOperations.microDouble256ToInteger256 | 1024 | 143.876 | 4649.642 | 32.31700909 >> VectorFPtoIntCastOperations.microDouble256ToInteger512 | 1024 | 38.127 | 38.188 | 1.001599916 >> VectorFPtoIntCastOperations.microDouble256ToLong128 | 1024 | 160.322 | 162.442 | 1.013223388 >> VectorFPtoIntCastOperations.microDouble256ToLong256 | 1024 | 141.252 | 143.01 | 1.012445841 >> VectorFPtoIntCastOperations.microDouble256ToShort128 | 1024 | 157.717 | 3757.471 | 23.82413437 >> VectorFPtoIntCastOperations.microDouble256ToShort256 | 1024 | 143.876 | 3830.971 | 26.62689399 >> VectorFPtoIntCastOperations.microDouble256ToShort512 | 1024 | 32.061 | 32.911 | 1.026511962 >> VectorFPtoIntCastOperations.microFloat128ToByte128 | 1024 | 146.599 | 4002.967 | 27.30555461 >> VectorFPtoIntCastOperations.microFloat128ToByte256 | 1024 | 136.99 | 3938.799 | 28.75245638 >> VectorFPtoIntCastOperations.microFloat128ToByte512 | 1024 | 51.561 | 50.284 | 0.975233219 >> VectorFPtoIntCastOperations.microFloat128ToInteger128 | 1024 | 5933.565 | 5361.472 | 0.903583596 >> VectorFPtoIntCastOperations.microFloat128ToInteger256 | 1024 | 5079.564 | 5062.046 | 0.996551279 >> VectorFPtoIntCastOperations.microFloat128ToInteger512 | 1024 | 37.101 | 38.419 | 1.035524649 >> VectorFPtoIntCastOperations.microFloat128ToLong128 | 1024 | 145.863 | 145.362 | 0.99656527 >> VectorFPtoIntCastOperations.microFloat128ToLong256 | 1024 | 131.159 | 133.154 | 1.015210546 >> VectorFPtoIntCastOperations.microFloat128ToShort128 | 1024 | 145.966 | 4150.039 | 28.4315457 >> VectorFPtoIntCastOperations.microFloat128ToShort256 | 1024 | 134.703 | 4566.589 | 33.90116775 >> VectorFPtoIntCastOperations.microFloat128ToShort512 | 1024 | 31.878 | 30.867 | 0.968285338 >> VectorFPtoIntCastOperations.microFloat256ToByte128 | 1024 | 237.841 | 6292.051 | 26.4548627 >> VectorFPtoIntCastOperations.microFloat256ToByte256 | 1024 | 222.041 | 6292.748 | 28.34047766 >> VectorFPtoIntCastOperations.microFloat256ToByte512 | 1024 | 92.073 | 88.981 | 0.966417951 >> VectorFPtoIntCastOperations.microFloat256ToInteger128 | 1024 | 11471.121 | 10269.636 | 0.895260019 >> VectorFPtoIntCastOperations.microFloat256ToInteger256 | 1024 | 10729.816 | 10105.92 | 0.941853989 >> VectorFPtoIntCastOperations.microFloat256ToInteger512 | 1024 | 68.328 | 70.005 | 1.024543379 >> VectorFPtoIntCastOperations.microFloat256ToLong128 | 1024 | 247.101 | 248.571 | 1.005948984 >> VectorFPtoIntCastOperations.microFloat256ToLong256 | 1024 | 225.74 | 223.987 | 0.992234429 >> VectorFPtoIntCastOperations.microFloat256ToLong512 | 1024 | 76.39 | 76.187 | 0.997342584 >> VectorFPtoIntCastOperations.microFloat256ToShort128 | 1024 | 233.196 | 8202.179 | 35.17289748 >> VectorFPtoIntCastOperations.microFloat256ToShort256 | 1024 | 220.75 | 7781.073 | 35.24834881 >> VectorFPtoIntCastOperations.microFloat256ToShort512 | 1024 | 58.143 | 55.633 | 0.956830573 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8288043: Review comments resolutions. Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9748 From kvn at openjdk.org Mon Oct 10 20:06:57 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Oct 2022 20:06:57 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v4] In-Reply-To: References: Message-ID: On Mon, 19 Sep 2022 14:15:19 GMT, Jatin Bhateja wrote: >> Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? >> Currently they are only run for AVX512DQ platforms. > >> Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? Currently they are only run for AVX512DQ platforms. > > I have added missing casting cases AVX/AVX2 and AVX512 targets in existing comprehensive test for [casting](test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java.) @jatin-bhateja, please merge latest JDK and I will start re-testing. ------------- PR: https://git.openjdk.org/jdk/pull/9748 From svkamath at openjdk.org Mon Oct 10 20:15:54 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Mon, 10 Oct 2022 20:15:54 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v8] In-Reply-To: References: <8LEJXqdKQPQe3lNuMSQql9YLgbcESJzfupkgORdvsFc=.807157d6-4506-4f04-ba20-a032d6ba973c@github.com> Message-ID: <8Hz7TtN3qVWn324XlTdBZCCdUbSQfBFwNudrf65mMIs=.9b78e1af-2a21-469e-961b-607e36884637@github.com> On Fri, 30 Sep 2022 17:24:31 GMT, Vladimir Kozlov wrote: >> @vnkozlov I spoke too soon. All the GHA tests passed in the dummy draft PR I created using Smita's patch: >> https://github.com/openjdk/jdk/pull/10500 >> Please take a look. No build failures reported and all tier1 tests passed. > >> @sviswa7 The failure is due to [JDK-8293618](https://bugs.openjdk.org/browse/JDK-8293618), @smita-kamath please merge with master. Thanks. > > Yes, I tested with latest JDK sources which includes JDK-8293618. @vnkozlov, I have implemented all of the reviewers comments. Could you kindly test this patch? Thanks a lot for your help. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From kvn at openjdk.org Mon Oct 10 21:10:10 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Oct 2022 21:10:10 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v13] In-Reply-To: References: Message-ID: <66be8SJdxPOqmqsQ1YIwS4zM4GwPerypGIf8IbfxhRs=.1d03c94a-f3e5-40ae-999e-bdd5f328170d@github.com> On Thu, 6 Oct 2022 06:28:04 GMT, Smita Kamath wrote: >> 8289552: Make intrinsic conversions between bit representations of half precision values and floats > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated instruct to use kmovw I started new testing. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From dlong at openjdk.org Mon Oct 10 21:18:52 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 10 Oct 2022 21:18:52 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v5] In-Reply-To: References: Message-ID: On Sat, 8 Oct 2022 15:42:31 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refactor includes Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/8025 From kvn at openjdk.org Mon Oct 10 22:43:58 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Oct 2022 22:43:58 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v5] In-Reply-To: References: Message-ID: On Sat, 8 Oct 2022 15:42:31 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refactor includes I started build testing. But I can only verify 64-bit. @merykitty Can you verify 32 build too? ------------- PR: https://git.openjdk.org/jdk/pull/8025 From vlivanov at openjdk.org Mon Oct 10 23:10:11 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 10 Oct 2022 23:10:11 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v5] In-Reply-To: References: Message-ID: On Sat, 8 Oct 2022 15:42:31 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refactor includes src/hotspot/cpu/x86/x86_64.ad line 324: > 322: > 323: source_hpp %{ > 324: #include CPU_HEADER(peephole) Why don't you simply include `peephole_x86_64.hpp` here? ------------- PR: https://git.openjdk.org/jdk/pull/8025 From vlivanov at openjdk.org Mon Oct 10 23:34:23 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 10 Oct 2022 23:34:23 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 16:50:28 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix x86 tests. Very nice work, Cesar. Most notably, I'm happy to see the test with so many non-trivial cases enumerated. Speaking of the proposed patch itself, I'm not done yet reviewing it. As of now, I don't fully grasp what's the purpose and motivation to introduce `ReducedAllocationMerge`. I would be grateful for additional information about how you ended up with the current design. (I went through the email thread, but it didn't help me.) In particular, I still don't understand how it interacts with existing scalar replacement logic when it comes to unique (per-allocation) memory slices. Also, on bailouts when the new analysis fails: I instrumented all possible failure modes with asserts and all the test failures I saw were in 2 places: src/hotspot/share/opto/callnode.cpp:1912 } else if (input->bottom_type()->base() == Type::Memory) { // Somehow the base was eliminated and we still have a memory reference left assert(false, ""); return NULL; } src/hotspot/share/opto/macro.cpp:2612 // In some cases the region controlling the RAM might go away due to some simplification // of the IR graph. For now, we'll just bail out if this happens. if (n->in(0) == NULL || !n->in(0)->is_Region()) { assert(false, ""); C->record_failure(C2Compiler::retry_no_reduce_allocation_merges()); return; } How hard would it be to extend the test with cases which demonstrate existing limitations? src/hotspot/share/opto/escape.hpp line 545: > 543: bool split_AddP(Node *addp, Node *base); > 544: > 545: PhiNode *create_split_phi(PhiNode *orig_phi, int alias_idx, GrowableArray *orig_phi_worklist, bool &new_created); What's the point of converting `orig_phi_worklist` into a pointer? ------------- PR: https://git.openjdk.org/jdk/pull/9073 From qamai at openjdk.org Tue Oct 11 00:24:30 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Oct 2022 00:24:30 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v6] In-Reply-To: References: Message-ID: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: rename header to x64 specific ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8025/files - new: https://git.openjdk.org/jdk/pull/8025/files/566b8dd1..72a9499c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=04-05 Stats: 4 lines in 3 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/8025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8025/head:pull/8025 PR: https://git.openjdk.org/jdk/pull/8025 From qamai at openjdk.org Tue Oct 11 00:24:32 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Oct 2022 00:24:32 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v5] In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 23:03:05 GMT, Vladimir Ivanov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> refactor includes > > src/hotspot/cpu/x86/x86_64.ad line 324: > >> 322: >> 323: source_hpp %{ >> 324: #include CPU_HEADER(peephole) > > Why don't you simply include `peephole_x86_64.hpp` here? Done ------------- PR: https://git.openjdk.org/jdk/pull/8025 From qamai at openjdk.org Tue Oct 11 00:26:22 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Oct 2022 00:26:22 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Tue, 4 Oct 2022 17:26:04 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> check index > > You need second review. @vnkozlov I saw GHA being happy with x86_32, I also tried cross-compiling locally for 32-bit build. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From njian at openjdk.org Tue Oct 11 01:16:21 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Tue, 11 Oct 2022 01:16:21 GMT Subject: RFR: 8294262: AArch64: compiler/vectorapi/TestReverseByteTransforms.java test failed on SVE machine [v2] In-Reply-To: References: Message-ID: On Sat, 1 Oct 2022 00:21:24 GMT, Eric Liu wrote: >> This test failed at cases test_reversebytes_short/int/long_transform2, which expected the ReversBytesV node, but nothing was finally found. On SVE system, we have a specific optimization, `ReverseBytesV (ReverseBytesV X MASK) MASK => X`, which eliminates both ReverseBytesV nodes. This optimization rule is specifically on hardware with native predicate support. See https://github.com/openjdk/jdk/pull/9623 for more details. >> >> As there is an SVE specific case TestReverseByteTransformsSVE.java, this patch simply marks TestReverseByteTransforms.java as non-SVE only. >> >> [TEST] >> jdk/incubator/vector, hotspot/compiler/vectorapi pass on SVE machine > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > add comment > > Change-Id: I4c17256ff656528bbcfcacd2ee2380df6ae14bf1 Looks good. ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.org/jdk/pull/10442 From njian at openjdk.org Tue Oct 11 01:17:25 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Tue, 11 Oct 2022 01:17:25 GMT Subject: RFR: 8294186: AArch64: VectorMaskToLong failed on SVE2 machine with -XX:UseSVE=1 [v2] In-Reply-To: <8DwuwmReKGKRgl34NleQUXepGoosyWlRxEsQxtj_vbE=.ca094a2e-8e40-41cf-918f-828e15a72799@github.com> References: <8DwuwmReKGKRgl34NleQUXepGoosyWlRxEsQxtj_vbE=.ca094a2e-8e40-41cf-918f-828e15a72799@github.com> Message-ID: <6JuUfdSBd4gycq-5nhCf6D_RcULzh2FOq6XIMB6VjWI=.d961773e-4060-483a-8445-598d6fb4392a@github.com> On Wed, 28 Sep 2022 14:31:21 GMT, Eric Liu wrote: >> C2_MacroAssembler::sve_vmask_tolong would fail on BITPERM supported SVE2 machine with "-XX:UseSVE=1". >> >> `BITPERM` is an optional feature in SVE2. With this feature, VectorMaskToLong has a more efficent implementation. For other cases, it should generate SVE1 code. >> >> [TEST] >> jdk/incubator/vector, hotspot/compiler/vectorapi passed on BITPERM supported SVE2 machine, with option -XX:UseSVE=(0, 1, 2). > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > Refine comment > > Change-Id: I785817a0068098e9c48221cb391ef776186ef5de Marked as reviewed by njian (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10443 From qamai at openjdk.org Tue Oct 11 01:19:45 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Oct 2022 01:19:45 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v7] In-Reply-To: References: Message-ID: <1RH0_k_aEHU61kz0Razfek_Kj4OLUMDN5P3j_pmyBYI=.95fe56d7-2253-468e-a85e-16286f5d1c74@github.com> > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: rename macro guard ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8025/files - new: https://git.openjdk.org/jdk/pull/8025/files/72a9499c..9a0b65fe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8025&range=05-06 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/8025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8025/head:pull/8025 PR: https://git.openjdk.org/jdk/pull/8025 From eliu at openjdk.org Tue Oct 11 01:41:27 2022 From: eliu at openjdk.org (Eric Liu) Date: Tue, 11 Oct 2022 01:41:27 GMT Subject: Integrated: 8294262: AArch64: compiler/vectorapi/TestReverseByteTransforms.java test failed on SVE machine In-Reply-To: References: Message-ID: On Tue, 27 Sep 2022 09:31:43 GMT, Eric Liu wrote: > This test failed at cases test_reversebytes_short/int/long_transform2, which expected the ReversBytesV node, but nothing was finally found. On SVE system, we have a specific optimization, `ReverseBytesV (ReverseBytesV X MASK) MASK => X`, which eliminates both ReverseBytesV nodes. This optimization rule is specifically on hardware with native predicate support. See https://github.com/openjdk/jdk/pull/9623 for more details. > > As there is an SVE specific case TestReverseByteTransformsSVE.java, this patch simply marks TestReverseByteTransforms.java as non-SVE only. > > [TEST] > jdk/incubator/vector, hotspot/compiler/vectorapi pass on SVE machine This pull request has now been integrated. Changeset: 9d116ec1 Author: Eric Liu URL: https://git.openjdk.org/jdk/commit/9d116ec147a3182a9c831ffdce02c98da8c5031d Stats: 5 lines in 1 file changed: 4 ins; 0 del; 1 mod 8294262: AArch64: compiler/vectorapi/TestReverseByteTransforms.java test failed on SVE machine Reviewed-by: aph, njian ------------- PR: https://git.openjdk.org/jdk/pull/10442 From kvn at openjdk.org Tue Oct 11 02:07:23 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Oct 2022 02:07:23 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Tue, 4 Oct 2022 17:26:04 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> check index > > You need second review. > @vnkozlov I saw GHA being happy with x86_32, I also tried cross-compiling locally for 32-bit build. Good. My builds in tier1 also passed. I don't have any more comments. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From kvn at openjdk.org Tue Oct 11 02:09:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Oct 2022 02:09:31 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v13] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 06:28:04 GMT, Smita Kamath wrote: >> 8289552: Make intrinsic conversions between bit representations of half precision values and floats > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated instruct to use kmovw Latest version v12 passed my tier1-4 testing. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9781 From qamai at openjdk.org Tue Oct 11 04:24:26 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Oct 2022 04:24:26 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v7] In-Reply-To: <1RH0_k_aEHU61kz0Razfek_Kj4OLUMDN5P3j_pmyBYI=.95fe56d7-2253-468e-a85e-16286f5d1c74@github.com> References: <1RH0_k_aEHU61kz0Razfek_Kj4OLUMDN5P3j_pmyBYI=.95fe56d7-2253-468e-a85e-16286f5d1c74@github.com> Message-ID: On Tue, 11 Oct 2022 01:19:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > rename macro guard Thank you very much for your reviews! ------------- PR: https://git.openjdk.org/jdk/pull/8025 From rcastanedalo at openjdk.org Tue Oct 11 07:15:28 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 11 Oct 2022 07:15:28 GMT Subject: Integrated: 8294356: IGV: scheduled graphs contain duplicated elements In-Reply-To: References: Message-ID: On Mon, 26 Sep 2022 13:55:56 GMT, Roberto Casta?eda Lozano wrote: > This changeset removes duplicated nodes and edges from graph dumps that include a control-flow graph: > ![cfg-before-after](https://user-images.githubusercontent.com/8792647/192294554-73ca3927-dab3-4d8f-9503-0904a4da7434.png) > This is achieved by ensuring that HotSpot only visits each node once when dumping IGV graphs. > > #### Testing > - Tested that tens of thousands of graphs do not contain duplicated nodes or edges by instrumenting IGV and running `java -Xcomp -XX:PrintIdealGraphLevel=4`. > - Tested manually that unscheduled graphs are not affected by this changeset. > - Tested that running compiler tests with `-XX:PrintIdealGraphLevel=3` does not trigger any failure. This pull request has now been integrated. Changeset: 97f1321c Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/97f1321cb455b536f1e4e056dec693c24f39d641 Stats: 18 lines in 1 file changed: 4 ins; 7 del; 7 mod 8294356: IGV: scheduled graphs contain duplicated elements Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10423 From xgong at openjdk.org Tue Oct 11 07:25:56 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 11 Oct 2022 07:25:56 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v9] In-Reply-To: References: Message-ID: > The current implementation of the vector mask cast operation is > complex that the compiler generates different patterns for different > scenarios. For architectures that do not support the predicate > feature, vector mask is represented the same as the normal vector. > So the vector mask cast is implemented by `VectorCast `node. But this > is not always needed. When two masks have the same element size (e.g. > int vs. float), their bits layout are the same. So casting between > them does not need to emit any instructions. > > Currently the compiler generates different patterns based on the > vector type of the input/output and the platforms. Normally the > "`VectorMaskCast`" op is only used for cases that doesn't emit any > instructions, and "`VectorCast`" op is used to implement the necessary > expand/narrow operations. This can avoid adding some duplicate rules > in the backend. However, this also has the drawbacks: > > 1) The codes are complex, especially when the compiler needs to > check whether the hardware supports the necessary IRs for the > vector mask cast. It needs to check different patterns for > different cases. > 2) The vector mask cast operation could be implemented with cheaper > instructions than the vector casting on some architectures. > > Instead of generating `VectorCast `or `VectorMaskCast `nodes for different > cases of vector mask cast operations, this patch unifies the vector > mask cast implementation with "`VectorMaskCast`" node for all vector types > and platforms. The missing backend rules are also added for it. > > This patch also simplies the vector mask conversion happened in > "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can > be optimized to "`vmask`" if the unboxing type matches with the boxed > "`vmask`" type. Otherwise, it needs the type conversion. Currently the > "`VectorUnbox`" will be transformed to two different patterns to implement > the conversion: > > 1) If the element size is not changed, it is transformed to: > > "VectorMaskCast vmask" > > 2) Otherwise, it is transformed to: > > "VectorLoadMask (VectorStoreMask vmask)" > > It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", > and then uses "`VectorLoadMask`" to convert the boolean vector to the > dst mask vector. Since this patch makes "`VectorMaskCast`" op supported > for all types on all platforms, it doesn't need the "`VectorLoadMask`" and > "`VectorStoreMask`" to do the conversion. The existing transformation: > > VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) > > can be simplified to: > > VectorUnbox (VectorBox vmask) => VectorMaskCast vmask Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: - Merge latest 'jdk:master' - Use "setDefaultWarmup" instead of adding the annotation for each test - Merge branch 'jdk:master' into JDK-8292898 - Change to use "avx512vl" cpu feature for some IR tests - Add the IR test and fix review comments on x86 backend - Remove untaken code paths on x86 match rules - Add assertion to the elem num for mast cast - Merge branch 'jdk:master' into JDK-8292898 - 8292898: [vectorapi] Unify vector mask cast operation - Merge branch 'jdk:master' into JDK-8291600 - ... and 8 more: https://git.openjdk.org/jdk/compare/9d116ec1...5aab47d5 ------------- Changes: https://git.openjdk.org/jdk/pull/10192/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10192&range=08 Stats: 596 lines in 10 files changed: 289 ins; 141 del; 166 mod Patch: https://git.openjdk.org/jdk/pull/10192.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10192/head:pull/10192 PR: https://git.openjdk.org/jdk/pull/10192 From haosun at openjdk.org Tue Oct 11 07:50:28 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 11 Oct 2022 07:50:28 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension In-Reply-To: References: Message-ID: On Fri, 23 Sep 2022 11:13:40 GMT, Bhavana Kilambi wrote: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. test/hotspot/jtreg/compiler/vectorization/TestEor3AArch64.java line 87: > 85: @Test > 86: @IR(counts = {"veor3_neon", "> 0"}, applyIf = {"MaxVectorSize", "16"}) > 87: @IR(counts = {"veor3_sve", "> 0"}, applyIfAnd = {"UseSVE", "2", "MaxVectorSize", "> 16"}) Suggestion: @IR(counts = {"veor3_sve", "> 0"}, applyIfAnd = {"UseSVE", "2", "MaxVectorSize", "> 16"}, applyIfCPUFeature = {"svesha3", "true"}) After this PR(https://github.com/openjdk/jdk/pull/10402), `applyIf` and `applyIfCPUFeature` are evaluated as a logical conjunction. We can check CPU features and VM options at the same time now. Of course, the comment at line 79 should be removed. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From aturbanov at openjdk.org Tue Oct 11 08:07:59 2022 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Tue, 11 Oct 2022 08:07:59 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v18] In-Reply-To: References: Message-ID: On Tue, 4 Oct 2022 13:39:36 GMT, Tobias Holenstein wrote: >> Cleanup of the code in IGV without changing the functionality. >> >> - removed dead code (unused classes, functions, variables) from the IGV code base >> - merged (and removed) redundant functions >> - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV >> - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package >> - made class variables `final` whenever possible >> - removed `this.` in `this.funtion()` funciton calls when it was not needed >> - used lambdas instead of anonymous class if possible >> - fixed whitespace issues (e.g. double whitespace) >> - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` >> - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > re-add hideDuplicates.png src/utils/IdealGraphVisualizer/Data/src/main/java/com/sun/hotspot/igv/data/InputEdge.java line 43: > 41: > 42: public static final Comparator OUTGOING_COMPARATOR = (o1, o2) -> { > 43: if(o1.getFromIndex() == o2.getFromIndex()) { Suggestion: if (o1.getFromIndex() == o2.getFromIndex()) { src/utils/IdealGraphVisualizer/Data/src/main/java/com/sun/hotspot/igv/data/InputEdge.java line 50: > 48: > 49: public static final Comparator INGOING_COMPARATOR = (o1, o2) -> { > 50: if(o1.getToIndex() == o2.getToIndex()) { Suggestion: if (o1.getToIndex() == o2.getToIndex()) { src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/Diagram.java line 73: > 71: > 72: for (InputBlock b : graph.getBlocks()) { > 73: blocks.put(b, new Block(b, this)); Suggestion: blocks.put(b, new Block(b, this)); ------------- PR: https://git.openjdk.org/jdk/pull/10197 From shade at openjdk.org Tue Oct 11 09:55:25 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 09:55:25 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 I think it requires a bit more fiddling for other arches: $ CXX=aarch64-linux-gnu-g++ CC=aarch64-linux-gnu-gcc sh ./configure --with-debug-level=fastdebug --openjdk-target=aarch64-linux-gnu --with-sysroot=/chroots/arm64 --with-boot-jdk=/home/shade/Install/jdk19u-ea --with-hsdis=binutils --with-binutils-src=binutils-2.39 ... ------------- PR: https://git.openjdk.org/jdk/pull/10628 From shade at openjdk.org Tue Oct 11 10:11:26 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 10:11:26 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 I think checking it like this would be more robust? if test "x$conf_openjdk_target" != "x"; then binutils_target="--host=$conf_openjdk_target" else binutils_target="" fi This allows me to produce AArch64 hsdis: $ file ./build/linux-aarch64-server-fastdebug/support/hsdis/libhsdis.so ./build/linux-aarch64-server-fastdebug/support/hsdis/libhsdis.so: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, BuildID[sha1]=c709135f1f2ef2bee329550044fabea35e33ebb3, not stripped ------------- PR: https://git.openjdk.org/jdk/pull/10628 From shade at openjdk.org Tue Oct 11 10:43:23 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 10:43:23 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Tue, 11 Oct 2022 10:09:02 GMT, Aleksey Shipilev wrote: > I think checking it like this would be more robust? > > ``` > if test "x$conf_openjdk_target" != "x"; then > binutils_target="--host=$conf_openjdk_target" > else > binutils_target="" > fi > ``` Also need to pass `AR` to binutils build and configure with `AR=riscv64-linux-gnu-ar` to get the RISC-V cross-build back: diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 index d72bbf6df32..7be49fbf778 100644 --- a/make/autoconf/lib-hsdis.m4 +++ b/make/autoconf/lib-hsdis.m4 @@ -175,12 +175,16 @@ AC_DEFUN([LIB_BUILD_BINUTILS], fi else binutils_cc="$CC $SYSROOT_CFLAGS" - binutils_target="" + if test "x$conf_openjdk_target" != "x"; then + binutils_target="--host=$conf_openjdk_target" + else + binutils_target="" + fi fi binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" AC_MSG_NOTICE([Running binutils configure]) - AC_MSG_NOTICE([configure command line: ./configure --disable-nls CFLAGS="$binutils_cflags" CC="$binutils_cc" $binutils_target]) + AC_MSG_NOTICE([configure command line: ./configure --disable-nls CFLAGS="$binutils_cflags" CC="$binutils_cc" AR="$AR" $binutils_target]) saved_dir=`pwd` cd "$BINUTILS_SRC" ./configure --disable-nls CFLAGS="$binutils_cflags" CC="$binutils_cc" $binutils_target This allows building hsdis on following arches with server ports: i686-linux-gnu x86_64-linux-gnu aarch64-linux-gnu powerpc64le-linux-gnu s390x-linux-gnu arm-linux-gnueabihf riscv64-linux-gnu ------------- PR: https://git.openjdk.org/jdk/pull/10628 From ihse at openjdk.org Tue Oct 11 10:51:29 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 11 Oct 2022 10:51:29 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Looks good to me. ------------- Marked as reviewed by ihse (Reviewer). PR: https://git.openjdk.org/jdk/pull/10628 From shade at openjdk.org Tue Oct 11 11:15:23 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 11:15:23 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Tue, 11 Oct 2022 10:41:07 GMT, Aleksey Shipilev wrote: > This allows building hsdis on following arches with server ports: Also these Zero ports produce `hsdis` binaries as well (they are not as useful there, though, because no JIT compilers are done there): alpha-linux-gnu arm-linux-gnueabi m68k-linux-gnu mips64el-linux-gnuabi64 mipsel-linux-gnu powerpc-linux-gnu sh4-linux-gnu sparc64-linux-gnu ------------- PR: https://git.openjdk.org/jdk/pull/10628 From ihse at openjdk.org Tue Oct 11 11:37:04 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 11 Oct 2022 11:37:04 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: <7RkL4CT3ovVO7aMm9y08k0sLDg1mK2TouRRGRHc97M8=.60c4178e-0e2d-439e-8342-0e5603004e17@github.com> On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Changing my review; we need to take this for another spin. ------------- Changes requested by ihse (Reviewer). PR: https://git.openjdk.org/jdk/pull/10628 From ihse at openjdk.org Tue Oct 11 11:37:07 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 11 Oct 2022 11:37:07 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Tue, 11 Oct 2022 10:09:02 GMT, Aleksey Shipilev wrote: > I think checking it like this would be more robust? > > ``` > if test "x$conf_openjdk_target" != "x"; then > binutils_target="--host=$conf_openjdk_target" > else > binutils_target="" > fi > ``` > > This allows me to produce AArch64 hsdis: > > ``` > $ file ./build/linux-aarch64-server-fastdebug/support/hsdis/libhsdis.so > ./build/linux-aarch64-server-fastdebug/support/hsdis/libhsdis.so: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, BuildID[sha1]=c709135f1f2ef2bee329550044fabea35e33ebb3, not stripped > ``` Yeah, you're onto something. We should not check the autoconf variables. I'd even recommend testing for cross compilation like this: if test "x$COMPILE_TYPE" = xcross; then # ... We also have a `CROSS_COMPILE_ARCH` which is set to `$OPENJDK_$1_CPU_LEGACY`, but right now I can't say how it relates to `$conf_openjdk_target`. But I'd rather not use the `$conf_...` variables outside the option testing code. If we do need it, and it is not already available, we should "export" it by giving it a separate, uppercase variable name. ------------- PR: https://git.openjdk.org/jdk/pull/10628 From ihse at openjdk.org Tue Oct 11 11:40:14 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 11 Oct 2022 11:40:14 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Ok, we already have an exported value for `$host`, which is `$OPENJDK_TARGET_AUTOCONF_NAME`. Also, `$conf_openjdk_target` is used in the wrapper configure script. It is probably leaking into the main generated autoconf script, but it is definitely not supposed to be used there. Instead, it should only be used to setup the `--host=` option to autoconf. So looking for `$host` is fine I suppose, but we should do it using the OPENJDK_TARGET_AUTOCONF_NAME variable. ------------- PR: https://git.openjdk.org/jdk/pull/10628 From jbhateja at openjdk.org Tue Oct 11 12:25:45 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Oct 2022 12:25:45 GMT Subject: RFR: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 Message-ID: Problem occurs in iterative DF analysis during CCP optimization, meet operations drops the speculative types before converging participating lattice values since [include_speculative argument it receives is always set to false ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/type.hpp#L231)where as [equality check ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.cpp#L1751) in the failing assertion is performed against original type still carrying the speculative type. To fix this, type comparison in the assertion should also be done after stripping the speculative type, with this change intermittent assertion failures in several vector API tests reported in the bug report are no longer seen. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 Changes: https://git.openjdk.org/jdk/pull/10648/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10648&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293531 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10648.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10648/head:pull/10648 PR: https://git.openjdk.org/jdk/pull/10648 From jbhateja at openjdk.org Tue Oct 11 12:30:18 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Oct 2022 12:30:18 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v9] In-Reply-To: References: Message-ID: > Hi All, > > This patch extends conversion optimizations added with [JDK-8287835](https://bugs.openjdk.org/browse/JDK-8287835) to optimize following floating point to integral conversions for X86 AVX2 targets:- > * D2I , D2S, D2B, F2I , F2S, F2B > > In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature. > * D2I, D2S, D2B > > Following are the JMH micro performance results with and without patch. > > System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz) > > BENCHMARK | SIZE | BASELINE (ops/ms) | WITHOPT (ops/ms) | PERF GAIN FACTOR > -- | -- | -- | -- | -- > VectorFPtoIntCastOperations.microDouble128ToByte128 | 1024 | 90.603 | 92.797 | 1.024215534 > VectorFPtoIntCastOperations.microDouble128ToByte256 | 1024 | 81.909 | 82.3 | 1.00477359 > VectorFPtoIntCastOperations.microDouble128ToByte512 | 1024 | 26.181 | 26.244 | 1.002406325 > VectorFPtoIntCastOperations.microDouble128ToInteger128 | 1024 | 90.74 | 2537.958 | 27.96956138 > VectorFPtoIntCastOperations.microDouble128ToInteger256 | 1024 | 81.586 | 2429.599 | 29.7796068 > VectorFPtoIntCastOperations.microDouble128ToInteger512 | 1024 | 19.406 | 19.61 | 1.010512213 > VectorFPtoIntCastOperations.microDouble128ToLong128 | 1024 | 91.723 | 90.754 | 0.989435583 > VectorFPtoIntCastOperations.microDouble128ToShort128 | 1024 | 91.766 | 1984.577 | 21.62649565 > VectorFPtoIntCastOperations.microDouble128ToShort256 | 1024 | 81.949 | 1940.599 | 23.68056962 > VectorFPtoIntCastOperations.microDouble128ToShort512 | 1024 | 16.468 | 16.56 | 1.005586592 > VectorFPtoIntCastOperations.microDouble256ToByte128 | 1024 | 163.331 | 3018.351 | 18.479964 > VectorFPtoIntCastOperations.microDouble256ToByte256 | 1024 | 148.878 | 3082.034 | 20.70174237 > VectorFPtoIntCastOperations.microDouble256ToByte512 | 1024 | 50.108 | 51.629 | 1.030354434 > VectorFPtoIntCastOperations.microDouble256ToInteger128 | 1024 | 159.805 | 4619.421 | 28.90661118 > VectorFPtoIntCastOperations.microDouble256ToInteger256 | 1024 | 143.876 | 4649.642 | 32.31700909 > VectorFPtoIntCastOperations.microDouble256ToInteger512 | 1024 | 38.127 | 38.188 | 1.001599916 > VectorFPtoIntCastOperations.microDouble256ToLong128 | 1024 | 160.322 | 162.442 | 1.013223388 > VectorFPtoIntCastOperations.microDouble256ToLong256 | 1024 | 141.252 | 143.01 | 1.012445841 > VectorFPtoIntCastOperations.microDouble256ToShort128 | 1024 | 157.717 | 3757.471 | 23.82413437 > VectorFPtoIntCastOperations.microDouble256ToShort256 | 1024 | 143.876 | 3830.971 | 26.62689399 > VectorFPtoIntCastOperations.microDouble256ToShort512 | 1024 | 32.061 | 32.911 | 1.026511962 > VectorFPtoIntCastOperations.microFloat128ToByte128 | 1024 | 146.599 | 4002.967 | 27.30555461 > VectorFPtoIntCastOperations.microFloat128ToByte256 | 1024 | 136.99 | 3938.799 | 28.75245638 > VectorFPtoIntCastOperations.microFloat128ToByte512 | 1024 | 51.561 | 50.284 | 0.975233219 > VectorFPtoIntCastOperations.microFloat128ToInteger128 | 1024 | 5933.565 | 5361.472 | 0.903583596 > VectorFPtoIntCastOperations.microFloat128ToInteger256 | 1024 | 5079.564 | 5062.046 | 0.996551279 > VectorFPtoIntCastOperations.microFloat128ToInteger512 | 1024 | 37.101 | 38.419 | 1.035524649 > VectorFPtoIntCastOperations.microFloat128ToLong128 | 1024 | 145.863 | 145.362 | 0.99656527 > VectorFPtoIntCastOperations.microFloat128ToLong256 | 1024 | 131.159 | 133.154 | 1.015210546 > VectorFPtoIntCastOperations.microFloat128ToShort128 | 1024 | 145.966 | 4150.039 | 28.4315457 > VectorFPtoIntCastOperations.microFloat128ToShort256 | 1024 | 134.703 | 4566.589 | 33.90116775 > VectorFPtoIntCastOperations.microFloat128ToShort512 | 1024 | 31.878 | 30.867 | 0.968285338 > VectorFPtoIntCastOperations.microFloat256ToByte128 | 1024 | 237.841 | 6292.051 | 26.4548627 > VectorFPtoIntCastOperations.microFloat256ToByte256 | 1024 | 222.041 | 6292.748 | 28.34047766 > VectorFPtoIntCastOperations.microFloat256ToByte512 | 1024 | 92.073 | 88.981 | 0.966417951 > VectorFPtoIntCastOperations.microFloat256ToInteger128 | 1024 | 11471.121 | 10269.636 | 0.895260019 > VectorFPtoIntCastOperations.microFloat256ToInteger256 | 1024 | 10729.816 | 10105.92 | 0.941853989 > VectorFPtoIntCastOperations.microFloat256ToInteger512 | 1024 | 68.328 | 70.005 | 1.024543379 > VectorFPtoIntCastOperations.microFloat256ToLong128 | 1024 | 247.101 | 248.571 | 1.005948984 > VectorFPtoIntCastOperations.microFloat256ToLong256 | 1024 | 225.74 | 223.987 | 0.992234429 > VectorFPtoIntCastOperations.microFloat256ToLong512 | 1024 | 76.39 | 76.187 | 0.997342584 > VectorFPtoIntCastOperations.microFloat256ToShort128 | 1024 | 233.196 | 8202.179 | 35.17289748 > VectorFPtoIntCastOperations.microFloat256ToShort256 | 1024 | 220.75 | 7781.073 | 35.24834881 > VectorFPtoIntCastOperations.microFloat256ToShort512 | 1024 | 58.143 | 55.633 | 0.956830573 > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8288043 - 8288043: Review comments resolutions. - 8288043: Adding descriptive comments. - 8288043: cost adjustments for loop body size estimation. - 8288043: Extending exiting regressions with more cases. - 8288043: Code re-factoring. - 8288043: Some mainline merge realted cleanups. - 8288043: Adding a descriptive comment for removing explicit scratch registers needed to load stub constants. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8288043 - 8288043: Adding a descriptive comment. - ... and 3 more: https://git.openjdk.org/jdk/compare/9d116ec1...6fb1a5d9 ------------- Changes: https://git.openjdk.org/jdk/pull/9748/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9748&range=08 Stats: 995 lines in 10 files changed: 796 ins; 55 del; 144 mod Patch: https://git.openjdk.org/jdk/pull/9748.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9748/head:pull/9748 PR: https://git.openjdk.org/jdk/pull/9748 From jbhateja at openjdk.org Tue Oct 11 12:30:19 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Oct 2022 12:30:19 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v4] In-Reply-To: References: Message-ID: On Mon, 19 Sep 2022 14:15:19 GMT, Jatin Bhateja wrote: >> Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? >> Currently they are only run for AVX512DQ platforms. > >> Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? Currently they are only run for AVX512DQ platforms. > > I have added missing casting cases AVX/AVX2 and AVX512 targets in existing comprehensive test for [casting](test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java.) > @jatin-bhateja, please merge latest JDK and I will start re-testing. Hi @kvn, kindly regress the changes. ------------- PR: https://git.openjdk.org/jdk/pull/9748 From shade at openjdk.org Tue Oct 11 13:09:14 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 13:09:14 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Tue, 11 Oct 2022 11:38:02 GMT, Magnus Ihse Bursie wrote: > Ok, we already have an exported value for `$host`, which is `$OPENJDK_TARGET_AUTOCONF_NAME`. Also, `$conf_openjdk_target` is used in the wrapper configure script. It is probably leaking into the main generated autoconf script, but it is definitely not supposed to be used there. Instead, it should only be used to setup the `--host=` option to autoconf. So looking for `$host` is fine I suppose, but we should do it using the OPENJDK_TARGET_AUTOCONF_NAME variable. Quite! Applying this patch over the PR: diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 index dddc1cf6a4d..72bd08c7108 100644 --- a/make/autoconf/lib-hsdis.m4 +++ b/make/autoconf/lib-hsdis.m4 @@ -175,10 +175,10 @@ AC_DEFUN([LIB_BUILD_BINUTILS], fi else binutils_cc="$CC $SYSROOT_CFLAGS" - if test "x$host" = "x$build"; then - binutils_target="" + if test "x$COMPILE_TYPE" = xcross; then + binutils_target="--host=$OPENJDK_TARGET_AUTOCONF_NAME" else - binutils_target="--host=$host" + binutils_target="" fi fi binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" ...successfully produces the hsdis binaries on all these platforms: server-release-aarch64-linux-gnu-10 server-release-arm-linux-gnueabihf-10 server-release-i686-linux-gnu-10 server-release-powerpc64le-linux-gnu-10 server-release-powerpc64-linux-gnu-10 server-release-riscv64-linux-gnu-10 server-release-s390x-linux-gnu-10 server-release-x86_64-linux-gnu-10 zero-release-aarch64-linux-gnu-10 zero-release-alpha-linux-gnu-10 zero-release-arm-linux-gnueabi-10 zero-release-arm-linux-gnueabihf-10 zero-release-i686-linux-gnu-10 zero-release-m68k-linux-gnu-10 zero-release-mips64el-linux-gnuabi64-10 zero-release-mipsel-linux-gnu-10 zero-release-powerpc64le-linux-gnu-10 zero-release-powerpc64-linux-gnu-10 zero-release-powerpc-linux-gnu-10 zero-release-riscv64-linux-gnu-10 zero-release-s390x-linux-gnu-10 zero-release-sh4-linux-gnu-10 zero-release-sparc64-linux-gnu-10 zero-release-x86_64-linux-gnu-10 Therefore, I believe this is what we should do and then call it a day. (Then I also need to start building all these hsdis-es at [https://builds.shipilev.net/hsdis/](https://builds.shipilev.net/hsdis/)) ------------- PR: https://git.openjdk.org/jdk/pull/10628 From vladimir.kozlov at oracle.com Tue Oct 11 15:44:00 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 11 Oct 2022 08:44:00 -0700 Subject: [EXTERNAL][External] : Re: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> Message-ID: Hi Liu, To clarify, I think your proposal is reasonable and we would like you to continue to work on it. Can you share details of its current status? Igor and I commented about C2 specifics you need to take into account. I can suggest to start with simple things: only objects instances. Arrays can be covered later. You can use the test from Cesar's PR [1] to test your implementation and extend it with additional cases. Thanks, Vladimir K [1] https://github.com/openjdk/jdk/pull/9073 On 10/7/22 3:26 PM, Liu, Xin wrote: > Hi, Igor and Vladimir, > > I am not inventing anything new. All I am thinking is how to adapt > Stadler's algorithm to C2. All innovation belong to the author. > > Figure-3 of my RFC is a copy of Listing-7 in his paper. Allow me to > repeat his data structure here. I drop "class Id" because I think I can > use AllocationNode pointer or even node idx instead. > > // this is per allocation, identified by 'Id'. > class VirtualState: extends ObjectState { > int lockCount; > Node[] entries; > }; > > // this is per basic-block > class State { > Map state; > Map alias; > }; > > In basic block, PEA keeps tracking the allocation state of an object > using VirtualState. In his paper, Figure-4 (b) and (e) depict how the > algorithm tracks stores. To get flow-sensitive information, Stadler > iterates the scheduled nodes in a basic block. I propose to iterate > bytecodes within a basic block. > >> when you rematerialize the object, it consumes the current updated > values to construct it. How to you intend to track those? >>> Yes, you either track stores in Parser or do what current C2 EA does > and create unique memory slices for VirtualObject. > > I plan to follow suit and track stores in parser! I also need to create > a unique memory slice when I have to materialize a virtual object. This > is for InitializeNode and I need to initialize the object to the > cumulative state. > > thanks, > --lx > > > > > On 10/7/22 1:21 PM, Vladimir Kozlov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> On 10/7/22 10:37 AM, Igor Veresov wrote: >>> The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. >>> >>> I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? >> >> Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. >> >> Current C2 EA [1] looks for latest stores (or initial values) to the object (which has unique Aloccation node id) >> staring from Safepoint memory input when we replace Allocate with SafePointScalarObject. >> >> You would need to use VirtualObject node id as unique instance id. And you need to create separate memory slices for it >> as we do in EA for Allocation node. >> >> Vladimir K >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L452 >> >>> >>> igor >>> >>>> On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: >>>> >>>> hi, Ignor, >>>> >>>> You are right. Cloning the JVMState of original Allocation Node isn't >>>> the correct behavior. I need the JVMState right at materialization. I >>>> think it is available because we are in parser. For 2 places of >>>> materialization: >>>> 1) we are handling the bytecode which causes the object to escape. It's >>>> probably putfield/return/invoke. Current JVMState it is. >>>> 2) we are in MergeProcessor. We need to materialize a virtual object in >>>> its predecessors. We can extract the exiting JVMState from the >>>> predecessor Block. >>>> >>>> I just realize maybe that's the one of the reasons Graal saves >>>> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >>>> when its PEA phase does materialization in high-tier. >>>> >>>> Apart from safepoint, there's one corner case bothering me. JLS says >>>> that creation of a class instance may throw an >>>> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >>>> >>>> " >>>> space is allocated for the new class instance. If there is insufficient >>>> space to allocate the object, evaluation of the class instance creation >>>> expression completes abruptly by throwing an OutOfMemoryError. >>>> " >>>> >>>> and it's cross-referenced by bytecode new in JVMS >>>> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >>>> >>>> If we have moved the Allocation Node and JVM happens to run out of >>>> memory, the first frame of stacktrace will drift a little bit, right? >>>> The bci and source linenum will be wrong. Does it matter? I can't >>>> imagine that user's programs rely on this information. >>>> >>>> I think it's possible to amend this bci/line number in JVMState level. I >>>> will leave it as an open question and revisit it later. >>>> >>>> Do I understand your concern? if it makes sense to you, I will update >>>> the RFC doc. >>>> >>>> thanks, >>>> --lx >>>> >>>> >>>> >>>> >>>> On 10/6/22 3:00 PM, Igor Veresov wrote: >>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>>>> >>>>> Igor >>>>> >>>>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>>>> that, so I come up an idea to focus only on escaped objects in the >>>>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>>>> May I get your precious time on this? >>>>>> >>>>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>>>> >>>>>> The idea is based on the following two observations. >>>>>> >>>>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>>>> >>>>>> If an object moves to the place it is about to escape, it won't impact >>>>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>>>> won't do anything for it anyway. >>>>>> >>>>>> If PEA don't touch a non-escaped object, it won't change its >>>>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>>>> >>>>>> >>>>>> 2. The original AllocationNode is either dead or scalar replaceable >>>>>> after Stadler's PEA. >>>>>> >>>>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>>>> least one of its predecessor has materialized. MergeProcessor has to >>>>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>>>> >>>>>> We can prove the observation 2 using 'proof of contradiction' here. >>>>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>>>> after Stadler's PEA, and program is still correct. >>>>>> >>>>>> Program must need the original allocation node somewhere. The algorithm >>>>>> has deleted the original allocation node in virtualization step and >>>>>> never bring it back. It contradicts that the program is still correct. QED. >>>>>> >>>>>> >>>>>> If you're convinced, then we can leverage it. In my design, I don't >>>>>> virtualize the original node but just leave it there. C2 MacroExpand >>>>>> phase will take care of the original allocation node as long as it's >>>>>> either dead or scalar-replaceable. It never get a chance to expand. >>>>>> >>>>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>>>> delegate it to C2 EA/SR! There are 3 gains: >>>>>> >>>>>> 1) I don't think I can write bug-free Scalar Replacement... >>>>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>>>> future, such as JDK-8289943. >>>>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>>>> Object states for deoptimization. Escaped objects disqualify that. >>>>>> >>>>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>>>> Optimization. 2014. >>>>>> >>>>>> thanks, >>>>>> --lx >>>>>> >>>> >> > From kvn at openjdk.org Tue Oct 11 15:56:29 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Oct 2022 15:56:29 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v9] In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 12:30:18 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch extends conversion optimizations added with [JDK-8287835](https://bugs.openjdk.org/browse/JDK-8287835) to optimize following floating point to integral conversions for X86 AVX2 targets:- >> * D2I , D2S, D2B, F2I , F2S, F2B >> >> In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature. >> * D2I, D2S, D2B >> >> Following are the JMH micro performance results with and without patch. >> >> System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz) >> >> BENCHMARK | SIZE | BASELINE (ops/ms) | WITHOPT (ops/ms) | PERF GAIN FACTOR >> -- | -- | -- | -- | -- >> VectorFPtoIntCastOperations.microDouble128ToByte128 | 1024 | 90.603 | 92.797 | 1.024215534 >> VectorFPtoIntCastOperations.microDouble128ToByte256 | 1024 | 81.909 | 82.3 | 1.00477359 >> VectorFPtoIntCastOperations.microDouble128ToByte512 | 1024 | 26.181 | 26.244 | 1.002406325 >> VectorFPtoIntCastOperations.microDouble128ToInteger128 | 1024 | 90.74 | 2537.958 | 27.96956138 >> VectorFPtoIntCastOperations.microDouble128ToInteger256 | 1024 | 81.586 | 2429.599 | 29.7796068 >> VectorFPtoIntCastOperations.microDouble128ToInteger512 | 1024 | 19.406 | 19.61 | 1.010512213 >> VectorFPtoIntCastOperations.microDouble128ToLong128 | 1024 | 91.723 | 90.754 | 0.989435583 >> VectorFPtoIntCastOperations.microDouble128ToShort128 | 1024 | 91.766 | 1984.577 | 21.62649565 >> VectorFPtoIntCastOperations.microDouble128ToShort256 | 1024 | 81.949 | 1940.599 | 23.68056962 >> VectorFPtoIntCastOperations.microDouble128ToShort512 | 1024 | 16.468 | 16.56 | 1.005586592 >> VectorFPtoIntCastOperations.microDouble256ToByte128 | 1024 | 163.331 | 3018.351 | 18.479964 >> VectorFPtoIntCastOperations.microDouble256ToByte256 | 1024 | 148.878 | 3082.034 | 20.70174237 >> VectorFPtoIntCastOperations.microDouble256ToByte512 | 1024 | 50.108 | 51.629 | 1.030354434 >> VectorFPtoIntCastOperations.microDouble256ToInteger128 | 1024 | 159.805 | 4619.421 | 28.90661118 >> VectorFPtoIntCastOperations.microDouble256ToInteger256 | 1024 | 143.876 | 4649.642 | 32.31700909 >> VectorFPtoIntCastOperations.microDouble256ToInteger512 | 1024 | 38.127 | 38.188 | 1.001599916 >> VectorFPtoIntCastOperations.microDouble256ToLong128 | 1024 | 160.322 | 162.442 | 1.013223388 >> VectorFPtoIntCastOperations.microDouble256ToLong256 | 1024 | 141.252 | 143.01 | 1.012445841 >> VectorFPtoIntCastOperations.microDouble256ToShort128 | 1024 | 157.717 | 3757.471 | 23.82413437 >> VectorFPtoIntCastOperations.microDouble256ToShort256 | 1024 | 143.876 | 3830.971 | 26.62689399 >> VectorFPtoIntCastOperations.microDouble256ToShort512 | 1024 | 32.061 | 32.911 | 1.026511962 >> VectorFPtoIntCastOperations.microFloat128ToByte128 | 1024 | 146.599 | 4002.967 | 27.30555461 >> VectorFPtoIntCastOperations.microFloat128ToByte256 | 1024 | 136.99 | 3938.799 | 28.75245638 >> VectorFPtoIntCastOperations.microFloat128ToByte512 | 1024 | 51.561 | 50.284 | 0.975233219 >> VectorFPtoIntCastOperations.microFloat128ToInteger128 | 1024 | 5933.565 | 5361.472 | 0.903583596 >> VectorFPtoIntCastOperations.microFloat128ToInteger256 | 1024 | 5079.564 | 5062.046 | 0.996551279 >> VectorFPtoIntCastOperations.microFloat128ToInteger512 | 1024 | 37.101 | 38.419 | 1.035524649 >> VectorFPtoIntCastOperations.microFloat128ToLong128 | 1024 | 145.863 | 145.362 | 0.99656527 >> VectorFPtoIntCastOperations.microFloat128ToLong256 | 1024 | 131.159 | 133.154 | 1.015210546 >> VectorFPtoIntCastOperations.microFloat128ToShort128 | 1024 | 145.966 | 4150.039 | 28.4315457 >> VectorFPtoIntCastOperations.microFloat128ToShort256 | 1024 | 134.703 | 4566.589 | 33.90116775 >> VectorFPtoIntCastOperations.microFloat128ToShort512 | 1024 | 31.878 | 30.867 | 0.968285338 >> VectorFPtoIntCastOperations.microFloat256ToByte128 | 1024 | 237.841 | 6292.051 | 26.4548627 >> VectorFPtoIntCastOperations.microFloat256ToByte256 | 1024 | 222.041 | 6292.748 | 28.34047766 >> VectorFPtoIntCastOperations.microFloat256ToByte512 | 1024 | 92.073 | 88.981 | 0.966417951 >> VectorFPtoIntCastOperations.microFloat256ToInteger128 | 1024 | 11471.121 | 10269.636 | 0.895260019 >> VectorFPtoIntCastOperations.microFloat256ToInteger256 | 1024 | 10729.816 | 10105.92 | 0.941853989 >> VectorFPtoIntCastOperations.microFloat256ToInteger512 | 1024 | 68.328 | 70.005 | 1.024543379 >> VectorFPtoIntCastOperations.microFloat256ToLong128 | 1024 | 247.101 | 248.571 | 1.005948984 >> VectorFPtoIntCastOperations.microFloat256ToLong256 | 1024 | 225.74 | 223.987 | 0.992234429 >> VectorFPtoIntCastOperations.microFloat256ToLong512 | 1024 | 76.39 | 76.187 | 0.997342584 >> VectorFPtoIntCastOperations.microFloat256ToShort128 | 1024 | 233.196 | 8202.179 | 35.17289748 >> VectorFPtoIntCastOperations.microFloat256ToShort256 | 1024 | 220.75 | 7781.073 | 35.24834881 >> VectorFPtoIntCastOperations.microFloat256ToShort512 | 1024 | 58.143 | 55.633 | 0.956830573 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8288043 > - 8288043: Review comments resolutions. > - 8288043: Adding descriptive comments. > - 8288043: cost adjustments for loop body size estimation. > - 8288043: Extending exiting regressions with more cases. > - 8288043: Code re-factoring. > - 8288043: Some mainline merge realted cleanups. > - 8288043: Adding a descriptive comment for removing explicit scratch registers needed to load stub constants. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8288043 > - 8288043: Adding a descriptive comment. > - ... and 3 more: https://git.openjdk.org/jdk/compare/9d116ec1...6fb1a5d9 I started new testing ------------- PR: https://git.openjdk.org/jdk/pull/9748 From ihse at openjdk.org Tue Oct 11 16:00:27 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 11 Oct 2022 16:00:27 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Also, I think we can skip the demo Makefile changes from this PR. In fact, I'm surprised `src/utils/hsdis/Makefile` is still there. I was sure I removed it when I rewrote the build of hsdis to utilize the normal build system with JDK-8188073. I'd rather open a separate issue and just remove the file. ------------- PR: https://git.openjdk.org/jdk/pull/10628 From ihse at openjdk.org Tue Oct 11 16:03:20 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 11 Oct 2022 16:03:20 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 See https://github.com/openjdk/jdk/pull/10660 ([JDK-8295163](https://bugs.openjdk.org/browse/JDK-8295163)) for the removal of the Makefile. ------------- PR: https://git.openjdk.org/jdk/pull/10628 From ihse at openjdk.org Tue Oct 11 16:07:14 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 11 Oct 2022 16:07:14 GMT Subject: RFR: 8295163: Remove old hsdis Makefile Message-ID: For some reason the old Makefile for hsdis was not removed when the build was moved into the normal build system in [JDK-8188073](https://bugs.openjdk.org/browse/JDK-8188073). This should be fixed. ------------- Commit messages: - 8295163: Remove old hsdis Makefile Changes: https://git.openjdk.org/jdk/pull/10660/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10660&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295163 Stats: 214 lines in 1 file changed: 0 ins; 214 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10660.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10660/head:pull/10660 PR: https://git.openjdk.org/jdk/pull/10660 From svkamath at openjdk.org Tue Oct 11 17:03:24 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 11 Oct 2022 17:03:24 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v13] In-Reply-To: <66be8SJdxPOqmqsQ1YIwS4zM4GwPerypGIf8IbfxhRs=.1d03c94a-f3e5-40ae-999e-bdd5f328170d@github.com> References: <66be8SJdxPOqmqsQ1YIwS4zM4GwPerypGIf8IbfxhRs=.1d03c94a-f3e5-40ae-999e-bdd5f328170d@github.com> Message-ID: On Mon, 10 Oct 2022 21:05:58 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated instruct to use kmovw > > I started new testing. @vnkozlov Thank you for reviewing the patch. ------------- PR: https://git.openjdk.org/jdk/pull/9781 From svkamath at openjdk.org Tue Oct 11 17:08:51 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Tue, 11 Oct 2022 17:08:51 GMT Subject: Integrated: 8289552: Make intrinsic conversions between bit representations of half precision values and floats In-Reply-To: References: Message-ID: On Fri, 5 Aug 2022 16:36:23 GMT, Smita Kamath wrote: > 8289552: Make intrinsic conversions between bit representations of half precision values and floats This pull request has now been integrated. Changeset: 07946aa4 Author: Smita Kamath Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/07946aa49c97c93bd11675a9b0b90d07c83f2a94 Stats: 350 lines in 19 files changed: 339 ins; 5 del; 6 mod 8289552: Make intrinsic conversions between bit representations of half precision values and floats Reviewed-by: kvn, sviswanathan, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/9781 From eastigeevich at openjdk.org Tue Oct 11 17:36:20 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 11 Oct 2022 17:36:20 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v4] In-Reply-To: <3kOvEAlksouNjqXDcn3XNuJj97kx3uhj8UzlmZIYq_o=.517b466d-4577-4c11-b5d9-7709176136cf@github.com> References: <3kOvEAlksouNjqXDcn3XNuJj97kx3uhj8UzlmZIYq_o=.517b466d-4577-4c11-b5d9-7709176136cf@github.com> Message-ID: On Tue, 4 Oct 2022 20:31:44 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup src/hotspot/share/code/compressedStream.cpp line 198: > 196: if (nsize < min_expansion*2) { > 197: nsize = min_expansion*2; > 198: } We will not need the code, if we initialise `_size` to `max2(initial_size, UNSIGNED5::MAX_LENGTH)` in the constructor. I don't think `initial_size` less than `UNSIGNED5::MAX_LENGTH` makes sense. `grow()` is invoked when `_position >= _size`. So there are two cases: 1. `_position == _size` 2. `_position > _size` `_position < 2 * _size` will be satisfied for case 1. How do you guarantee `_position < 2 * _size` for case 2? src/hotspot/share/code/compressedStream.hpp line 119: > 117: u_char* _buffer; > 118: int _position; // current byte offset > 119: size_t _byte_pos {0}; // current bit offset Is it a bit offset in the byte at `_position`? `_byte_pos` does not sound clear. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From kvn at openjdk.org Tue Oct 11 17:41:21 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Oct 2022 17:41:21 GMT Subject: RFR: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 12:19:05 GMT, Jatin Bhateja wrote: > Problem occurs in iterative DF analysis during CCP optimization, meet operations drops the speculative types before converging participating lattice values since [include_speculative argument it receives is always set to false ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/type.hpp#L231)where as [equality check ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.cpp#L1751) in the failing assertion is performed against original type still carrying the speculative type. > > To fix this, type comparison in the assertion should also be done after stripping the speculative type, with this change intermittent assertion failures in several vector API tests reported in the bug report are no longer seen. > > Kindly review and share your feedback. > > Best Regards, > Jatin Looks reasonable. Do you know why only vector API tests were affected? I will test it. ------------- PR: https://git.openjdk.org/jdk/pull/10648 From xxinliu at amazon.com Tue Oct 11 19:12:46 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Tue, 11 Oct 2022 12:12:46 -0700 Subject: [External] : Re: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> Message-ID: <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> Hi, Vladimir and Igor, Thanks you for your comments. I just start it. My first target is 'Figure-1' in the RFC. There are only 3 blocks and no merge and no alias. Class Object is so trivial that we even don't need to initialize it (no field to initialize). I expect to get code like this after parser with a new flag. private Object _cache; public void foo(boolean cond) { Object x = new Object(); if (cond) { Object x1 = new Object(); (clone object right before escapement) _cache = x1; } } I know it's over-simplified. I am going to test whether it is possible to implement the algorithm in parser and how C2 EA/SR interacts with the obsolete object x. Like you said, I am going to focus on ordinary objects first. Either good or bad, I will update my progress and results. I will see how to leverage Cesar's test and microbenchmark too. That's my intention too. thanks, --lx On 10/11/22 8:44 AM, Vladimir Kozlov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Hi Liu, > > To clarify, I think your proposal is reasonable and we would like you to continue to work on it. > Can you share details of its current status? > > Igor and I commented about C2 specifics you need to take into account. > > I can suggest to start with simple things: only objects instances. Arrays can be covered later. > > You can use the test from Cesar's PR [1] to test your implementation and extend it with additional cases. > > Thanks, > Vladimir K > > [1] https://github.com/openjdk/jdk/pull/9073 > > On 10/7/22 3:26 PM, Liu, Xin wrote: >> Hi, Igor and Vladimir, >> >> I am not inventing anything new. All I am thinking is how to adapt >> Stadler's algorithm to C2. All innovation belong to the author. >> >> Figure-3 of my RFC is a copy of Listing-7 in his paper. Allow me to >> repeat his data structure here. I drop "class Id" because I think I can >> use AllocationNode pointer or even node idx instead. >> >> // this is per allocation, identified by 'Id'. >> class VirtualState: extends ObjectState { >> int lockCount; >> Node[] entries; >> }; >> >> // this is per basic-block >> class State { >> Map state; >> Map alias; >> }; >> >> In basic block, PEA keeps tracking the allocation state of an object >> using VirtualState. In his paper, Figure-4 (b) and (e) depict how the >> algorithm tracks stores. To get flow-sensitive information, Stadler >> iterates the scheduled nodes in a basic block. I propose to iterate >> bytecodes within a basic block. >> >>> when you rematerialize the object, it consumes the current updated >> values to construct it. How to you intend to track those? >>>> Yes, you either track stores in Parser or do what current C2 EA does >> and create unique memory slices for VirtualObject. >> >> I plan to follow suit and track stores in parser! I also need to create >> a unique memory slice when I have to materialize a virtual object. This >> is for InitializeNode and I need to initialize the object to the >> cumulative state. >> >> thanks, >> --lx >> >> >> >> >> On 10/7/22 1:21 PM, Vladimir Kozlov wrote: >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>> >>> >>> >>> On 10/7/22 10:37 AM, Igor Veresov wrote: >>>> The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. >>>> >>>> I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? >>> >>> Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. >>> >>> Current C2 EA [1] looks for latest stores (or initial values) to the object (which has unique Aloccation node id) >>> staring from Safepoint memory input when we replace Allocate with SafePointScalarObject. >>> >>> You would need to use VirtualObject node id as unique instance id. And you need to create separate memory slices for it >>> as we do in EA for Allocation node. >>> >>> Vladimir K >>> >>> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L452 >>> >>>> >>>> igor >>>> >>>>> On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: >>>>> >>>>> hi, Ignor, >>>>> >>>>> You are right. Cloning the JVMState of original Allocation Node isn't >>>>> the correct behavior. I need the JVMState right at materialization. I >>>>> think it is available because we are in parser. For 2 places of >>>>> materialization: >>>>> 1) we are handling the bytecode which causes the object to escape. It's >>>>> probably putfield/return/invoke. Current JVMState it is. >>>>> 2) we are in MergeProcessor. We need to materialize a virtual object in >>>>> its predecessors. We can extract the exiting JVMState from the >>>>> predecessor Block. >>>>> >>>>> I just realize maybe that's the one of the reasons Graal saves >>>>> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >>>>> when its PEA phase does materialization in high-tier. >>>>> >>>>> Apart from safepoint, there's one corner case bothering me. JLS says >>>>> that creation of a class instance may throw an >>>>> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >>>>> >>>>> " >>>>> space is allocated for the new class instance. If there is insufficient >>>>> space to allocate the object, evaluation of the class instance creation >>>>> expression completes abruptly by throwing an OutOfMemoryError. >>>>> " >>>>> >>>>> and it's cross-referenced by bytecode new in JVMS >>>>> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >>>>> >>>>> If we have moved the Allocation Node and JVM happens to run out of >>>>> memory, the first frame of stacktrace will drift a little bit, right? >>>>> The bci and source linenum will be wrong. Does it matter? I can't >>>>> imagine that user's programs rely on this information. >>>>> >>>>> I think it's possible to amend this bci/line number in JVMState level. I >>>>> will leave it as an open question and revisit it later. >>>>> >>>>> Do I understand your concern? if it makes sense to you, I will update >>>>> the RFC doc. >>>>> >>>>> thanks, >>>>> --lx >>>>> >>>>> >>>>> >>>>> >>>>> On 10/6/22 3:00 PM, Igor Veresov wrote: >>>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>>> >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>>>>> >>>>>> Igor >>>>>> >>>>>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>>>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>>>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>>>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>>>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>>>>> that, so I come up an idea to focus only on escaped objects in the >>>>>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>>>>> May I get your precious time on this? >>>>>>> >>>>>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>>>>> >>>>>>> The idea is based on the following two observations. >>>>>>> >>>>>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>>>>> >>>>>>> If an object moves to the place it is about to escape, it won't impact >>>>>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>>>>> won't do anything for it anyway. >>>>>>> >>>>>>> If PEA don't touch a non-escaped object, it won't change its >>>>>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>>>>> >>>>>>> >>>>>>> 2. The original AllocationNode is either dead or scalar replaceable >>>>>>> after Stadler's PEA. >>>>>>> >>>>>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>>>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>>>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>>>>> least one of its predecessor has materialized. MergeProcessor has to >>>>>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>>>>> >>>>>>> We can prove the observation 2 using 'proof of contradiction' here. >>>>>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>>>>> after Stadler's PEA, and program is still correct. >>>>>>> >>>>>>> Program must need the original allocation node somewhere. The algorithm >>>>>>> has deleted the original allocation node in virtualization step and >>>>>>> never bring it back. It contradicts that the program is still correct. QED. >>>>>>> >>>>>>> >>>>>>> If you're convinced, then we can leverage it. In my design, I don't >>>>>>> virtualize the original node but just leave it there. C2 MacroExpand >>>>>>> phase will take care of the original allocation node as long as it's >>>>>>> either dead or scalar-replaceable. It never get a chance to expand. >>>>>>> >>>>>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>>>>> delegate it to C2 EA/SR! There are 3 gains: >>>>>>> >>>>>>> 1) I don't think I can write bug-free Scalar Replacement... >>>>>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>>>>> future, such as JDK-8289943. >>>>>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>>>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>>>>> Object states for deoptimization. Escaped objects disqualify that. >>>>>>> >>>>>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>>>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>>>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>>>>> Optimization. 2014. >>>>>>> >>>>>>> thanks, >>>>>>> --lx >>>>>>> >>>>> >>>> -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From vladimir.x.ivanov at oracle.com Tue Oct 11 20:00:26 2022 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Tue, 11 Oct 2022 13:00:26 -0700 Subject: [EXTERNAL][EXTERNAL][External] : Re: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> Message-ID: <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> I'd suggest an even simpler case to start with: void test(boolean unlikely_condition) { MyClass obj = new MyClass(); if (unlikely_condition) { doCall(obj); // escape point; not inlinined } } and try to turn it into: void test(boolean unlikely_condition) { if (unlikely_condition) { doCall(new MyClass()); // escape point; not inlinined } } It allows you to not bother about JVM state at all, because there's already a valid one captured by the call. Best regards, Vladimir Ivanov On 10/11/22 12:12, Liu, Xin wrote: > Hi, Vladimir and Igor, > > Thanks you for your comments. > > I just start it. My first target is 'Figure-1' in the RFC. There are > only 3 blocks and no merge and no alias. Class Object is so trivial that > we even don't need to initialize it (no field to initialize). > > I expect to get code like this after parser with a new flag. > > private Object _cache; > public void foo(boolean cond) { > Object x = new Object(); > > if (cond) { > Object x1 = new Object(); (clone object right before escapement) > _cache = x1; > } > } > > I know it's over-simplified. I am going to test whether it is possible > to implement the algorithm in parser and how C2 EA/SR interacts with the > obsolete object x. > > Like you said, I am going to focus on ordinary objects first. Either > good or bad, I will update my progress and results. I will see how to > leverage Cesar's test and microbenchmark too. That's my intention too. > > thanks, > --lx > > On 10/11/22 8:44 AM, Vladimir Kozlov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Hi Liu, >> >> To clarify, I think your proposal is reasonable and we would like you to continue to work on it. >> Can you share details of its current status? >> >> Igor and I commented about C2 specifics you need to take into account. >> >> I can suggest to start with simple things: only objects instances. Arrays can be covered later. >> >> You can use the test from Cesar's PR [1] to test your implementation and extend it with additional cases. >> >> Thanks, >> Vladimir K >> >> [1] https://github.com/openjdk/jdk/pull/9073 >> >> On 10/7/22 3:26 PM, Liu, Xin wrote: >>> Hi, Igor and Vladimir, >>> >>> I am not inventing anything new. All I am thinking is how to adapt >>> Stadler's algorithm to C2. All innovation belong to the author. >>> >>> Figure-3 of my RFC is a copy of Listing-7 in his paper. Allow me to >>> repeat his data structure here. I drop "class Id" because I think I can >>> use AllocationNode pointer or even node idx instead. >>> >>> // this is per allocation, identified by 'Id'. >>> class VirtualState: extends ObjectState { >>> int lockCount; >>> Node[] entries; >>> }; >>> >>> // this is per basic-block >>> class State { >>> Map state; >>> Map alias; >>> }; >>> >>> In basic block, PEA keeps tracking the allocation state of an object >>> using VirtualState. In his paper, Figure-4 (b) and (e) depict how the >>> algorithm tracks stores. To get flow-sensitive information, Stadler >>> iterates the scheduled nodes in a basic block. I propose to iterate >>> bytecodes within a basic block. >>> >>>> when you rematerialize the object, it consumes the current updated >>> values to construct it. How to you intend to track those? >>>>> Yes, you either track stores in Parser or do what current C2 EA does >>> and create unique memory slices for VirtualObject. >>> >>> I plan to follow suit and track stores in parser! I also need to create >>> a unique memory slice when I have to materialize a virtual object. This >>> is for InitializeNode and I need to initialize the object to the >>> cumulative state. >>> >>> thanks, >>> --lx >>> >>> >>> >>> >>> On 10/7/22 1:21 PM, Vladimir Kozlov wrote: >>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>> >>>> >>>> >>>> On 10/7/22 10:37 AM, Igor Veresov wrote: >>>>> The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. >>>>> >>>>> I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? >>>> >>>> Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. >>>> >>>> Current C2 EA [1] looks for latest stores (or initial values) to the object (which has unique Aloccation node id) >>>> staring from Safepoint memory input when we replace Allocate with SafePointScalarObject. >>>> >>>> You would need to use VirtualObject node id as unique instance id. And you need to create separate memory slices for it >>>> as we do in EA for Allocation node. >>>> >>>> Vladimir K >>>> >>>> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L452 >>>> >>>>> >>>>> igor >>>>> >>>>>> On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: >>>>>> >>>>>> hi, Ignor, >>>>>> >>>>>> You are right. Cloning the JVMState of original Allocation Node isn't >>>>>> the correct behavior. I need the JVMState right at materialization. I >>>>>> think it is available because we are in parser. For 2 places of >>>>>> materialization: >>>>>> 1) we are handling the bytecode which causes the object to escape. It's >>>>>> probably putfield/return/invoke. Current JVMState it is. >>>>>> 2) we are in MergeProcessor. We need to materialize a virtual object in >>>>>> its predecessors. We can extract the exiting JVMState from the >>>>>> predecessor Block. >>>>>> >>>>>> I just realize maybe that's the one of the reasons Graal saves >>>>>> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >>>>>> when its PEA phase does materialization in high-tier. >>>>>> >>>>>> Apart from safepoint, there's one corner case bothering me. JLS says >>>>>> that creation of a class instance may throw an >>>>>> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >>>>>> >>>>>> " >>>>>> space is allocated for the new class instance. If there is insufficient >>>>>> space to allocate the object, evaluation of the class instance creation >>>>>> expression completes abruptly by throwing an OutOfMemoryError. >>>>>> " >>>>>> >>>>>> and it's cross-referenced by bytecode new in JVMS >>>>>> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >>>>>> >>>>>> If we have moved the Allocation Node and JVM happens to run out of >>>>>> memory, the first frame of stacktrace will drift a little bit, right? >>>>>> The bci and source linenum will be wrong. Does it matter? I can't >>>>>> imagine that user's programs rely on this information. >>>>>> >>>>>> I think it's possible to amend this bci/line number in JVMState level. I >>>>>> will leave it as an open question and revisit it later. >>>>>> >>>>>> Do I understand your concern? if it makes sense to you, I will update >>>>>> the RFC doc. >>>>>> >>>>>> thanks, >>>>>> --lx >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 10/6/22 3:00 PM, Igor Veresov wrote: >>>>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>>>>>> >>>>>>> Igor >>>>>>> >>>>>>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>>>>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>>>>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>>>>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>>>>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>>>>>> that, so I come up an idea to focus only on escaped objects in the >>>>>>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>>>>>> May I get your precious time on this? >>>>>>>> >>>>>>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>>>>>> >>>>>>>> The idea is based on the following two observations. >>>>>>>> >>>>>>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>>>>>> >>>>>>>> If an object moves to the place it is about to escape, it won't impact >>>>>>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>>>>>> won't do anything for it anyway. >>>>>>>> >>>>>>>> If PEA don't touch a non-escaped object, it won't change its >>>>>>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>>>>>> >>>>>>>> >>>>>>>> 2. The original AllocationNode is either dead or scalar replaceable >>>>>>>> after Stadler's PEA. >>>>>>>> >>>>>>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>>>>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>>>>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>>>>>> least one of its predecessor has materialized. MergeProcessor has to >>>>>>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>>>>>> >>>>>>>> We can prove the observation 2 using 'proof of contradiction' here. >>>>>>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>>>>>> after Stadler's PEA, and program is still correct. >>>>>>>> >>>>>>>> Program must need the original allocation node somewhere. The algorithm >>>>>>>> has deleted the original allocation node in virtualization step and >>>>>>>> never bring it back. It contradicts that the program is still correct. QED. >>>>>>>> >>>>>>>> >>>>>>>> If you're convinced, then we can leverage it. In my design, I don't >>>>>>>> virtualize the original node but just leave it there. C2 MacroExpand >>>>>>>> phase will take care of the original allocation node as long as it's >>>>>>>> either dead or scalar-replaceable. It never get a chance to expand. >>>>>>>> >>>>>>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>>>>>> delegate it to C2 EA/SR! There are 3 gains: >>>>>>>> >>>>>>>> 1) I don't think I can write bug-free Scalar Replacement... >>>>>>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>>>>>> future, such as JDK-8289943. >>>>>>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>>>>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>>>>>> Object states for deoptimization. Escaped objects disqualify that. >>>>>>>> >>>>>>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>>>>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>>>>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>>>>>> Optimization. 2014. >>>>>>>> >>>>>>>> thanks, >>>>>>>> --lx >>>>>>>> >>>>>> >>>> > From qamai at openjdk.org Tue Oct 11 20:33:04 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Oct 2022 20:33:04 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v4] In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 02:05:05 GMT, Vladimir Kozlov wrote: >> You need second review. > >> @vnkozlov I saw GHA being happy with x86_32, I also tried cross-compiling locally for 32-bit build. > > Good. My builds in tier1 also passed. I don't have any more comments. @vnkozlov Could you reapprove this PR, please? ------------- PR: https://git.openjdk.org/jdk/pull/8025 From dlong at openjdk.org Tue Oct 11 20:44:19 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 11 Oct 2022 20:44:19 GMT Subject: RFR: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 12:19:05 GMT, Jatin Bhateja wrote: > Problem occurs in iterative DF analysis during CCP optimization, meet operations drops the speculative types before converging participating lattice values since [include_speculative argument it receives is always set to false ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/type.hpp#L231)where as [equality check ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.cpp#L1751) in the failing assertion is performed against original type still carrying the speculative type. > > To fix this, type comparison in the assertion should also be done after stripping the speculative type, with this change intermittent assertion failures in several vector API tests reported in the bug report are no longer seen. > > Kindly review and share your feedback. > > Best Regards, > Jatin Could this bug explain JDK-8295028? ------------- PR: https://git.openjdk.org/jdk/pull/10648 From kvn at openjdk.org Tue Oct 11 21:36:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Oct 2022 21:36:31 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v7] In-Reply-To: <1RH0_k_aEHU61kz0Razfek_Kj4OLUMDN5P3j_pmyBYI=.95fe56d7-2253-468e-a85e-16286f5d1c74@github.com> References: <1RH0_k_aEHU61kz0Razfek_Kj4OLUMDN5P3j_pmyBYI=.95fe56d7-2253-468e-a85e-16286f5d1c74@github.com> Message-ID: On Tue, 11 Oct 2022 01:19:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > rename macro guard Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8025 From kvn at openjdk.org Tue Oct 11 21:41:12 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Oct 2022 21:41:12 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v9] In-Reply-To: References: Message-ID: <1M5PNOs3QIuE-gxRT1jcgMsqxKyUeOq1UV55JP_tC2o=.a2afc3b5-0557-49cd-b492-694429e02692@github.com> On Tue, 11 Oct 2022 12:30:18 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch extends conversion optimizations added with [JDK-8287835](https://bugs.openjdk.org/browse/JDK-8287835) to optimize following floating point to integral conversions for X86 AVX2 targets:- >> * D2I , D2S, D2B, F2I , F2S, F2B >> >> In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature. >> * D2I, D2S, D2B >> >> Following are the JMH micro performance results with and without patch. >> >> System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz) >> >> BENCHMARK | SIZE | BASELINE (ops/ms) | WITHOPT (ops/ms) | PERF GAIN FACTOR >> -- | -- | -- | -- | -- >> VectorFPtoIntCastOperations.microDouble128ToByte128 | 1024 | 90.603 | 92.797 | 1.024215534 >> VectorFPtoIntCastOperations.microDouble128ToByte256 | 1024 | 81.909 | 82.3 | 1.00477359 >> VectorFPtoIntCastOperations.microDouble128ToByte512 | 1024 | 26.181 | 26.244 | 1.002406325 >> VectorFPtoIntCastOperations.microDouble128ToInteger128 | 1024 | 90.74 | 2537.958 | 27.96956138 >> VectorFPtoIntCastOperations.microDouble128ToInteger256 | 1024 | 81.586 | 2429.599 | 29.7796068 >> VectorFPtoIntCastOperations.microDouble128ToInteger512 | 1024 | 19.406 | 19.61 | 1.010512213 >> VectorFPtoIntCastOperations.microDouble128ToLong128 | 1024 | 91.723 | 90.754 | 0.989435583 >> VectorFPtoIntCastOperations.microDouble128ToShort128 | 1024 | 91.766 | 1984.577 | 21.62649565 >> VectorFPtoIntCastOperations.microDouble128ToShort256 | 1024 | 81.949 | 1940.599 | 23.68056962 >> VectorFPtoIntCastOperations.microDouble128ToShort512 | 1024 | 16.468 | 16.56 | 1.005586592 >> VectorFPtoIntCastOperations.microDouble256ToByte128 | 1024 | 163.331 | 3018.351 | 18.479964 >> VectorFPtoIntCastOperations.microDouble256ToByte256 | 1024 | 148.878 | 3082.034 | 20.70174237 >> VectorFPtoIntCastOperations.microDouble256ToByte512 | 1024 | 50.108 | 51.629 | 1.030354434 >> VectorFPtoIntCastOperations.microDouble256ToInteger128 | 1024 | 159.805 | 4619.421 | 28.90661118 >> VectorFPtoIntCastOperations.microDouble256ToInteger256 | 1024 | 143.876 | 4649.642 | 32.31700909 >> VectorFPtoIntCastOperations.microDouble256ToInteger512 | 1024 | 38.127 | 38.188 | 1.001599916 >> VectorFPtoIntCastOperations.microDouble256ToLong128 | 1024 | 160.322 | 162.442 | 1.013223388 >> VectorFPtoIntCastOperations.microDouble256ToLong256 | 1024 | 141.252 | 143.01 | 1.012445841 >> VectorFPtoIntCastOperations.microDouble256ToShort128 | 1024 | 157.717 | 3757.471 | 23.82413437 >> VectorFPtoIntCastOperations.microDouble256ToShort256 | 1024 | 143.876 | 3830.971 | 26.62689399 >> VectorFPtoIntCastOperations.microDouble256ToShort512 | 1024 | 32.061 | 32.911 | 1.026511962 >> VectorFPtoIntCastOperations.microFloat128ToByte128 | 1024 | 146.599 | 4002.967 | 27.30555461 >> VectorFPtoIntCastOperations.microFloat128ToByte256 | 1024 | 136.99 | 3938.799 | 28.75245638 >> VectorFPtoIntCastOperations.microFloat128ToByte512 | 1024 | 51.561 | 50.284 | 0.975233219 >> VectorFPtoIntCastOperations.microFloat128ToInteger128 | 1024 | 5933.565 | 5361.472 | 0.903583596 >> VectorFPtoIntCastOperations.microFloat128ToInteger256 | 1024 | 5079.564 | 5062.046 | 0.996551279 >> VectorFPtoIntCastOperations.microFloat128ToInteger512 | 1024 | 37.101 | 38.419 | 1.035524649 >> VectorFPtoIntCastOperations.microFloat128ToLong128 | 1024 | 145.863 | 145.362 | 0.99656527 >> VectorFPtoIntCastOperations.microFloat128ToLong256 | 1024 | 131.159 | 133.154 | 1.015210546 >> VectorFPtoIntCastOperations.microFloat128ToShort128 | 1024 | 145.966 | 4150.039 | 28.4315457 >> VectorFPtoIntCastOperations.microFloat128ToShort256 | 1024 | 134.703 | 4566.589 | 33.90116775 >> VectorFPtoIntCastOperations.microFloat128ToShort512 | 1024 | 31.878 | 30.867 | 0.968285338 >> VectorFPtoIntCastOperations.microFloat256ToByte128 | 1024 | 237.841 | 6292.051 | 26.4548627 >> VectorFPtoIntCastOperations.microFloat256ToByte256 | 1024 | 222.041 | 6292.748 | 28.34047766 >> VectorFPtoIntCastOperations.microFloat256ToByte512 | 1024 | 92.073 | 88.981 | 0.966417951 >> VectorFPtoIntCastOperations.microFloat256ToInteger128 | 1024 | 11471.121 | 10269.636 | 0.895260019 >> VectorFPtoIntCastOperations.microFloat256ToInteger256 | 1024 | 10729.816 | 10105.92 | 0.941853989 >> VectorFPtoIntCastOperations.microFloat256ToInteger512 | 1024 | 68.328 | 70.005 | 1.024543379 >> VectorFPtoIntCastOperations.microFloat256ToLong128 | 1024 | 247.101 | 248.571 | 1.005948984 >> VectorFPtoIntCastOperations.microFloat256ToLong256 | 1024 | 225.74 | 223.987 | 0.992234429 >> VectorFPtoIntCastOperations.microFloat256ToLong512 | 1024 | 76.39 | 76.187 | 0.997342584 >> VectorFPtoIntCastOperations.microFloat256ToShort128 | 1024 | 233.196 | 8202.179 | 35.17289748 >> VectorFPtoIntCastOperations.microFloat256ToShort256 | 1024 | 220.75 | 7781.073 | 35.24834881 >> VectorFPtoIntCastOperations.microFloat256ToShort512 | 1024 | 58.143 | 55.633 | 0.956830573 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8288043 > - 8288043: Review comments resolutions. > - 8288043: Adding descriptive comments. > - 8288043: cost adjustments for loop body size estimation. > - 8288043: Extending exiting regressions with more cases. > - 8288043: Code re-factoring. > - 8288043: Some mainline merge realted cleanups. > - 8288043: Adding a descriptive comment for removing explicit scratch registers needed to load stub constants. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8288043 > - 8288043: Adding a descriptive comment. > - ... and 3 more: https://git.openjdk.org/jdk/compare/9d116ec1...6fb1a5d9 My testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9748 From xxinliu at amazon.com Tue Oct 11 22:00:22 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Tue, 11 Oct 2022 15:00:22 -0700 Subject: [External] : Re: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> Message-ID: hi, Vladimir Ivanov, Thanks. I will include your example too. I would like to verify the idea that we can delegate the obsolete object to C2 EA/SR. Here is even more general form of your example. I add 2 safepoints here. they are before and after the idiom respectively. void test(boolean unlikely_condition) { MyClass obj = new MyClass(); safepoint1(); if (unlikely_condition) { doCall(obj); // escape point; not inlinined } safepoint2(); } Here is what I expect to see after PEA. Materialization will take place at 2 places. I use obj1 and obj2 to highlight them. please note that I intentionally clone objects in PEA materialization. They eclipse the live range of the original obj. I refer to it as 'obsolete'. void test(boolean unlikely_condition) { MyClass obj = new MyClass(); safepoint1(); if (unlikely_condition) { obj1 = new MyClass(); doCall(obj1); // escape point; not inlinined } obj = merge(obj2=new MyClass(), obj1); safepoint2(); } I expect C2 EA/SR to pick up the obsolete 'obj'. At SafePoint1, it's not dead. C2 will convert it to SafePointScalarObjectNode. If I can prove this idea, it means that we can delegate Scalar Replacement to C2 EA/SR. We may get away with scalar replacement part in PEA implementation! thanks, --lx On 10/11/22 1:00 PM, Vladimir Ivanov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > I'd suggest an even simpler case to start with: > > void test(boolean unlikely_condition) { > MyClass obj = new MyClass(); > if (unlikely_condition) { > doCall(obj); // escape point; not inlinined > } > } > > and try to turn it into: > > void test(boolean unlikely_condition) { > if (unlikely_condition) { > doCall(new MyClass()); // escape point; not inlinined > } > } > > It allows you to not bother about JVM state at all, because there's > already a valid one captured by the call. > > Best regards, > Vladimir Ivanov > > On 10/11/22 12:12, Liu, Xin wrote: >> Hi, Vladimir and Igor, >> >> Thanks you for your comments. >> >> I just start it. My first target is 'Figure-1' in the RFC. There are >> only 3 blocks and no merge and no alias. Class Object is so trivial that >> we even don't need to initialize it (no field to initialize). >> >> I expect to get code like this after parser with a new flag. >> >> private Object _cache; >> public void foo(boolean cond) { >> Object x = new Object(); >> >> if (cond) { >> Object x1 = new Object(); (clone object right before escapement) >> _cache = x1; >> } >> } >> >> I know it's over-simplified. I am going to test whether it is possible >> to implement the algorithm in parser and how C2 EA/SR interacts with the >> obsolete object x. >> >> Like you said, I am going to focus on ordinary objects first. Either >> good or bad, I will update my progress and results. I will see how to >> leverage Cesar's test and microbenchmark too. That's my intention too. >> >> thanks, >> --lx >> >> On 10/11/22 8:44 AM, Vladimir Kozlov wrote: >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>> >>> >>> >>> Hi Liu, >>> >>> To clarify, I think your proposal is reasonable and we would like you to continue to work on it. >>> Can you share details of its current status? >>> >>> Igor and I commented about C2 specifics you need to take into account. >>> >>> I can suggest to start with simple things: only objects instances. Arrays can be covered later. >>> >>> You can use the test from Cesar's PR [1] to test your implementation and extend it with additional cases. >>> >>> Thanks, >>> Vladimir K >>> >>> [1] https://github.com/openjdk/jdk/pull/9073 >>> >>> On 10/7/22 3:26 PM, Liu, Xin wrote: >>>> Hi, Igor and Vladimir, >>>> >>>> I am not inventing anything new. All I am thinking is how to adapt >>>> Stadler's algorithm to C2. All innovation belong to the author. >>>> >>>> Figure-3 of my RFC is a copy of Listing-7 in his paper. Allow me to >>>> repeat his data structure here. I drop "class Id" because I think I can >>>> use AllocationNode pointer or even node idx instead. >>>> >>>> // this is per allocation, identified by 'Id'. >>>> class VirtualState: extends ObjectState { >>>> int lockCount; >>>> Node[] entries; >>>> }; >>>> >>>> // this is per basic-block >>>> class State { >>>> Map state; >>>> Map alias; >>>> }; >>>> >>>> In basic block, PEA keeps tracking the allocation state of an object >>>> using VirtualState. In his paper, Figure-4 (b) and (e) depict how the >>>> algorithm tracks stores. To get flow-sensitive information, Stadler >>>> iterates the scheduled nodes in a basic block. I propose to iterate >>>> bytecodes within a basic block. >>>> >>>>> when you rematerialize the object, it consumes the current updated >>>> values to construct it. How to you intend to track those? >>>>>> Yes, you either track stores in Parser or do what current C2 EA does >>>> and create unique memory slices for VirtualObject. >>>> >>>> I plan to follow suit and track stores in parser! I also need to create >>>> a unique memory slice when I have to materialize a virtual object. This >>>> is for InitializeNode and I need to initialize the object to the >>>> cumulative state. >>>> >>>> thanks, >>>> --lx >>>> >>>> >>>> >>>> >>>> On 10/7/22 1:21 PM, Vladimir Kozlov wrote: >>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>> >>>>> >>>>> >>>>> On 10/7/22 10:37 AM, Igor Veresov wrote: >>>>>> The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. >>>>>> >>>>>> I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? >>>>> >>>>> Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. >>>>> >>>>> Current C2 EA [1] looks for latest stores (or initial values) to the object (which has unique Aloccation node id) >>>>> staring from Safepoint memory input when we replace Allocate with SafePointScalarObject. >>>>> >>>>> You would need to use VirtualObject node id as unique instance id. And you need to create separate memory slices for it >>>>> as we do in EA for Allocation node. >>>>> >>>>> Vladimir K >>>>> >>>>> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L452 >>>>> >>>>>> >>>>>> igor >>>>>> >>>>>>> On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: >>>>>>> >>>>>>> hi, Ignor, >>>>>>> >>>>>>> You are right. Cloning the JVMState of original Allocation Node isn't >>>>>>> the correct behavior. I need the JVMState right at materialization. I >>>>>>> think it is available because we are in parser. For 2 places of >>>>>>> materialization: >>>>>>> 1) we are handling the bytecode which causes the object to escape. It's >>>>>>> probably putfield/return/invoke. Current JVMState it is. >>>>>>> 2) we are in MergeProcessor. We need to materialize a virtual object in >>>>>>> its predecessors. We can extract the exiting JVMState from the >>>>>>> predecessor Block. >>>>>>> >>>>>>> I just realize maybe that's the one of the reasons Graal saves >>>>>>> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >>>>>>> when its PEA phase does materialization in high-tier. >>>>>>> >>>>>>> Apart from safepoint, there's one corner case bothering me. JLS says >>>>>>> that creation of a class instance may throw an >>>>>>> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >>>>>>> >>>>>>> " >>>>>>> space is allocated for the new class instance. If there is insufficient >>>>>>> space to allocate the object, evaluation of the class instance creation >>>>>>> expression completes abruptly by throwing an OutOfMemoryError. >>>>>>> " >>>>>>> >>>>>>> and it's cross-referenced by bytecode new in JVMS >>>>>>> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >>>>>>> >>>>>>> If we have moved the Allocation Node and JVM happens to run out of >>>>>>> memory, the first frame of stacktrace will drift a little bit, right? >>>>>>> The bci and source linenum will be wrong. Does it matter? I can't >>>>>>> imagine that user's programs rely on this information. >>>>>>> >>>>>>> I think it's possible to amend this bci/line number in JVMState level. I >>>>>>> will leave it as an open question and revisit it later. >>>>>>> >>>>>>> Do I understand your concern? if it makes sense to you, I will update >>>>>>> the RFC doc. >>>>>>> >>>>>>> thanks, >>>>>>> --lx >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/6/22 3:00 PM, Igor Veresov wrote: >>>>>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>>>>>>> >>>>>>>> Igor >>>>>>>> >>>>>>>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>>>>>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>>>>>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>>>>>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>>>>>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>>>>>>> that, so I come up an idea to focus only on escaped objects in the >>>>>>>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>>>>>>> May I get your precious time on this? >>>>>>>>> >>>>>>>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>>>>>>> >>>>>>>>> The idea is based on the following two observations. >>>>>>>>> >>>>>>>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>>>>>>> >>>>>>>>> If an object moves to the place it is about to escape, it won't impact >>>>>>>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>>>>>>> won't do anything for it anyway. >>>>>>>>> >>>>>>>>> If PEA don't touch a non-escaped object, it won't change its >>>>>>>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2. The original AllocationNode is either dead or scalar replaceable >>>>>>>>> after Stadler's PEA. >>>>>>>>> >>>>>>>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>>>>>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>>>>>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>>>>>>> least one of its predecessor has materialized. MergeProcessor has to >>>>>>>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>>>>>>> >>>>>>>>> We can prove the observation 2 using 'proof of contradiction' here. >>>>>>>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>>>>>>> after Stadler's PEA, and program is still correct. >>>>>>>>> >>>>>>>>> Program must need the original allocation node somewhere. The algorithm >>>>>>>>> has deleted the original allocation node in virtualization step and >>>>>>>>> never bring it back. It contradicts that the program is still correct. QED. >>>>>>>>> >>>>>>>>> >>>>>>>>> If you're convinced, then we can leverage it. In my design, I don't >>>>>>>>> virtualize the original node but just leave it there. C2 MacroExpand >>>>>>>>> phase will take care of the original allocation node as long as it's >>>>>>>>> either dead or scalar-replaceable. It never get a chance to expand. >>>>>>>>> >>>>>>>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>>>>>>> delegate it to C2 EA/SR! There are 3 gains: >>>>>>>>> >>>>>>>>> 1) I don't think I can write bug-free Scalar Replacement... >>>>>>>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>>>>>>> future, such as JDK-8289943. >>>>>>>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>>>>>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>>>>>>> Object states for deoptimization. Escaped objects disqualify that. >>>>>>>>> >>>>>>>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>>>>>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>>>>>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>>>>>>> Optimization. 2014. >>>>>>>>> >>>>>>>>> thanks, >>>>>>>>> --lx >>>>>>>>> >>>>>>> >>>>>> -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From vladimir.kozlov at oracle.com Tue Oct 11 22:12:49 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 11 Oct 2022 15:12:49 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> Message-ID: Yes, you should delegate Scalar replacement to existing EA in C2. As you wrote in your proposal, PAE should be used for escaping cases. Your test with safepoints should be your next step after you resolved Vladimir's case without safepoints. You should show that your implementation can rematirealize an object at any escape site. Don't worry about rematirealize during deoptimization for now. Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore. Thanks, Vladimir K On 10/11/22 3:00 PM, Liu, Xin wrote: > hi, Vladimir Ivanov, > > Thanks. I will include your example too. > > I would like to verify the idea that we can delegate the obsolete object > to C2 EA/SR. > > Here is even more general form of your example. I add 2 safepoints here. > they are before and after the idiom respectively. > > void test(boolean unlikely_condition) { > MyClass obj = new MyClass(); > safepoint1(); > if (unlikely_condition) { > doCall(obj); // escape point; not inlinined > } > safepoint2(); > } > > Here is what I expect to see after PEA. Materialization will take place > at 2 places. I use obj1 and obj2 to highlight them. please note that I > intentionally clone objects in PEA materialization. They eclipse the > live range of the original obj. I refer to it as 'obsolete'. > > void test(boolean unlikely_condition) { > MyClass obj = new MyClass(); > safepoint1(); > if (unlikely_condition) { > obj1 = new MyClass(); > doCall(obj1); // escape point; not inlinined > } > obj = merge(obj2=new MyClass(), obj1); > safepoint2(); > } > > I expect C2 EA/SR to pick up the obsolete 'obj'. > At SafePoint1, it's not dead. C2 will convert it to > SafePointScalarObjectNode. If I can prove this idea, it means that we > can delegate Scalar Replacement to C2 EA/SR. We may get away with scalar > replacement part in PEA implementation! > > thanks, > --lx > > > > On 10/11/22 1:00 PM, Vladimir Ivanov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> I'd suggest an even simpler case to start with: >> >> void test(boolean unlikely_condition) { >> MyClass obj = new MyClass(); >> if (unlikely_condition) { >> doCall(obj); // escape point; not inlinined >> } >> } >> >> and try to turn it into: >> >> void test(boolean unlikely_condition) { >> if (unlikely_condition) { >> doCall(new MyClass()); // escape point; not inlinined >> } >> } >> >> It allows you to not bother about JVM state at all, because there's >> already a valid one captured by the call. >> >> Best regards, >> Vladimir Ivanov >> >> On 10/11/22 12:12, Liu, Xin wrote: >>> Hi, Vladimir and Igor, >>> >>> Thanks you for your comments. >>> >>> I just start it. My first target is 'Figure-1' in the RFC. There are >>> only 3 blocks and no merge and no alias. Class Object is so trivial that >>> we even don't need to initialize it (no field to initialize). >>> >>> I expect to get code like this after parser with a new flag. >>> >>> private Object _cache; >>> public void foo(boolean cond) { >>> Object x = new Object(); >>> >>> if (cond) { >>> Object x1 = new Object(); (clone object right before escapement) >>> _cache = x1; >>> } >>> } >>> >>> I know it's over-simplified. I am going to test whether it is possible >>> to implement the algorithm in parser and how C2 EA/SR interacts with the >>> obsolete object x. >>> >>> Like you said, I am going to focus on ordinary objects first. Either >>> good or bad, I will update my progress and results. I will see how to >>> leverage Cesar's test and microbenchmark too. That's my intention too. >>> >>> thanks, >>> --lx >>> >>> On 10/11/22 8:44 AM, Vladimir Kozlov wrote: >>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>> >>>> >>>> >>>> Hi Liu, >>>> >>>> To clarify, I think your proposal is reasonable and we would like you to continue to work on it. >>>> Can you share details of its current status? >>>> >>>> Igor and I commented about C2 specifics you need to take into account. >>>> >>>> I can suggest to start with simple things: only objects instances. Arrays can be covered later. >>>> >>>> You can use the test from Cesar's PR [1] to test your implementation and extend it with additional cases. >>>> >>>> Thanks, >>>> Vladimir K >>>> >>>> [1] https://github.com/openjdk/jdk/pull/9073 >>>> >>>> On 10/7/22 3:26 PM, Liu, Xin wrote: >>>>> Hi, Igor and Vladimir, >>>>> >>>>> I am not inventing anything new. All I am thinking is how to adapt >>>>> Stadler's algorithm to C2. All innovation belong to the author. >>>>> >>>>> Figure-3 of my RFC is a copy of Listing-7 in his paper. Allow me to >>>>> repeat his data structure here. I drop "class Id" because I think I can >>>>> use AllocationNode pointer or even node idx instead. >>>>> >>>>> // this is per allocation, identified by 'Id'. >>>>> class VirtualState: extends ObjectState { >>>>> int lockCount; >>>>> Node[] entries; >>>>> }; >>>>> >>>>> // this is per basic-block >>>>> class State { >>>>> Map state; >>>>> Map alias; >>>>> }; >>>>> >>>>> In basic block, PEA keeps tracking the allocation state of an object >>>>> using VirtualState. In his paper, Figure-4 (b) and (e) depict how the >>>>> algorithm tracks stores. To get flow-sensitive information, Stadler >>>>> iterates the scheduled nodes in a basic block. I propose to iterate >>>>> bytecodes within a basic block. >>>>> >>>>>> when you rematerialize the object, it consumes the current updated >>>>> values to construct it. How to you intend to track those? >>>>>>> Yes, you either track stores in Parser or do what current C2 EA does >>>>> and create unique memory slices for VirtualObject. >>>>> >>>>> I plan to follow suit and track stores in parser! I also need to create >>>>> a unique memory slice when I have to materialize a virtual object. This >>>>> is for InitializeNode and I need to initialize the object to the >>>>> cumulative state. >>>>> >>>>> thanks, >>>>> --lx >>>>> >>>>> >>>>> >>>>> >>>>> On 10/7/22 1:21 PM, Vladimir Kozlov wrote: >>>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>>> >>>>>> >>>>>> >>>>>> On 10/7/22 10:37 AM, Igor Veresov wrote: >>>>>>> The major difference between Graal and C2 is that graal captures the state at side effects and C2 captures the state at deopt points. That allows Graal to deduce state at any time, including when it needs to insert a rematerializing allocation during PEA. So, with C2 you have to either do everything in the parser as you are proposing or do the same thing as Graal and at least capture the state for stores. Having a state different from the original allocation point is ok. Both Graal and C2 would throw OOMs from place that could be far from the original point because of the EA. >>>>>>> >>>>>>> I think you also have to track the values of all of the object components, right? So when you rematerialize the object, it consumes the current updated values to construct it. How to you intend to track those? >>>>>> >>>>>> Yes, you either track stores in Parser or do what current C2 EA does and create unique memory slices for VirtualObject. >>>>>> >>>>>> Current C2 EA [1] looks for latest stores (or initial values) to the object (which has unique Aloccation node id) >>>>>> staring from Safepoint memory input when we replace Allocate with SafePointScalarObject. >>>>>> >>>>>> You would need to use VirtualObject node id as unique instance id. And you need to create separate memory slices for it >>>>>> as we do in EA for Allocation node. >>>>>> >>>>>> Vladimir K >>>>>> >>>>>> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L452 >>>>>> >>>>>>> >>>>>>> igor >>>>>>> >>>>>>>> On Oct 6, 2022, at 5:09 PM, Liu, Xin wrote: >>>>>>>> >>>>>>>> hi, Ignor, >>>>>>>> >>>>>>>> You are right. Cloning the JVMState of original Allocation Node isn't >>>>>>>> the correct behavior. I need the JVMState right at materialization. I >>>>>>>> think it is available because we are in parser. For 2 places of >>>>>>>> materialization: >>>>>>>> 1) we are handling the bytecode which causes the object to escape. It's >>>>>>>> probably putfield/return/invoke. Current JVMState it is. >>>>>>>> 2) we are in MergeProcessor. We need to materialize a virtual object in >>>>>>>> its predecessors. We can extract the exiting JVMState from the >>>>>>>> predecessor Block. >>>>>>>> >>>>>>>> I just realize maybe that's the one of the reasons Graal saves >>>>>>>> 'FrameState' at store nodes. Graal needs to revisit the 'FrameState' >>>>>>>> when its PEA phase does materialization in high-tier. >>>>>>>> >>>>>>>> Apart from safepoint, there's one corner case bothering me. JLS says >>>>>>>> that creation of a class instance may throw an >>>>>>>> OOME.(https://docs.oracle.com/javase/specs/jls/se19/html/jls-15.html#jls-15.9.4) >>>>>>>> >>>>>>>> " >>>>>>>> space is allocated for the new class instance. If there is insufficient >>>>>>>> space to allocate the object, evaluation of the class instance creation >>>>>>>> expression completes abruptly by throwing an OutOfMemoryError. >>>>>>>> " >>>>>>>> >>>>>>>> and it's cross-referenced by bytecode new in JVMS >>>>>>>> https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.new >>>>>>>> >>>>>>>> If we have moved the Allocation Node and JVM happens to run out of >>>>>>>> memory, the first frame of stacktrace will drift a little bit, right? >>>>>>>> The bci and source linenum will be wrong. Does it matter? I can't >>>>>>>> imagine that user's programs rely on this information. >>>>>>>> >>>>>>>> I think it's possible to amend this bci/line number in JVMState level. I >>>>>>>> will leave it as an open question and revisit it later. >>>>>>>> >>>>>>>> Do I understand your concern? if it makes sense to you, I will update >>>>>>>> the RFC doc. >>>>>>>> >>>>>>>> thanks, >>>>>>>> --lx >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 10/6/22 3:00 PM, Igor Veresov wrote: >>>>>>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> You say that when you materialize the clone you plan to have the same jvm state as the original allocation. How is that possible in a general case? There can be arbitrary changes of state between the original allocation point and where the clone materializes. >>>>>>>>> >>>>>>>>> Igor >>>>>>>>> >>>>>>>>>> On Oct 6, 2022, at 10:42 AM, Liu, Xin wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> We would like to pursuit PEA in HotSpot. I spent time thinking how to >>>>>>>>>> adapt Stadler's Partial Escape Analysis[1] to C2. I think there are 3 >>>>>>>>>> elements in it. 1) flow-sensitive escape analysis 2) lazy code motion >>>>>>>>>> for the allocation and initialization 3) on-the-fly scalar replacement. >>>>>>>>>> The most complex part is 3) and it has done by C2. I'd like to leverage >>>>>>>>>> that, so I come up an idea to focus only on escaped objects in the >>>>>>>>>> algorithm and delegate others to the existing C2 phases. Here is my RFC. >>>>>>>>>> May I get your precious time on this? >>>>>>>>>> >>>>>>>>>> https://gist.github.com/navyxliu/62a510a5c6b0245164569745d758935b#rfc-partial-escape-analysis-in-hotspot-c2 >>>>>>>>>> >>>>>>>>>> The idea is based on the following two observations. >>>>>>>>>> >>>>>>>>>> 1. Stadler's PEA can cooperate with C2 EA/SR. >>>>>>>>>> >>>>>>>>>> If an object moves to the place it is about to escape, it won't impact >>>>>>>>>> C2 EA/SR later. It's because it will be marked as 'GlobalEscaped'. C2 EA >>>>>>>>>> won't do anything for it anyway. >>>>>>>>>> >>>>>>>>>> If PEA don't touch a non-escaped object, it won't change its >>>>>>>>>> escapability. It can punt it to C2 EA/SR and the result is still same. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2. The original AllocationNode is either dead or scalar replaceable >>>>>>>>>> after Stadler's PEA. >>>>>>>>>> >>>>>>>>>> Stadler's algorithm virtualizes an allocation Node and materializes it >>>>>>>>>> on demand. There are 2 places to materialize it. 1) the virtual object >>>>>>>>>> is about to escape 2) MergeProcessor needs to merge an object and at >>>>>>>>>> least one of its predecessor has materialized. MergeProcessor has to >>>>>>>>>> materialize all virtual objects in other predecessors([1] 5.3, Merge nodes). >>>>>>>>>> >>>>>>>>>> We can prove the observation 2 using 'proof of contradiction' here. >>>>>>>>>> Assume the original Allocation node is neither dead nor Scalar Replaced >>>>>>>>>> after Stadler's PEA, and program is still correct. >>>>>>>>>> >>>>>>>>>> Program must need the original allocation node somewhere. The algorithm >>>>>>>>>> has deleted the original allocation node in virtualization step and >>>>>>>>>> never bring it back. It contradicts that the program is still correct. QED. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> If you're convinced, then we can leverage it. In my design, I don't >>>>>>>>>> virtualize the original node but just leave it there. C2 MacroExpand >>>>>>>>>> phase will take care of the original allocation node as long as it's >>>>>>>>>> either dead or scalar-replaceable. It never get a chance to expand. >>>>>>>>>> >>>>>>>>>> If we restrain on-the-fly scalar replacement in Stadler's PEA, we can >>>>>>>>>> delegate it to C2 EA/SR! There are 3 gains: >>>>>>>>>> >>>>>>>>>> 1) I don't think I can write bug-free Scalar Replacement... >>>>>>>>>> 2) This approach can automatically pick up C2 EA/SR improvements in the >>>>>>>>>> future, such as JDK-8289943. >>>>>>>>>> 3) If we focus only on 'escaped objects', we even don't need to deal >>>>>>>>>> with deoptimization. Only 'scalar replaceable' objects need to save >>>>>>>>>> Object states for deoptimization. Escaped objects disqualify that. >>>>>>>>>> >>>>>>>>>> [1]: Stadler, Lukas, Thomas W?rthinger, and Hanspeter M?ssenb?ck. >>>>>>>>>> "Partial escape analysis and scalar replacement for Java." Proceedings >>>>>>>>>> of Annual IEEE/ACM International Symposium on Code Generation and >>>>>>>>>> Optimization. 2014. >>>>>>>>>> >>>>>>>>>> thanks, >>>>>>>>>> --lx >>>>>>>>>> >>>>>>>> >>>>>> > From qamai at openjdk.org Tue Oct 11 23:07:05 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Oct 2022 23:07:05 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v7] In-Reply-To: <1RH0_k_aEHU61kz0Razfek_Kj4OLUMDN5P3j_pmyBYI=.95fe56d7-2253-468e-a85e-16286f5d1c74@github.com> References: <1RH0_k_aEHU61kz0Razfek_Kj4OLUMDN5P3j_pmyBYI=.95fe56d7-2253-468e-a85e-16286f5d1c74@github.com> Message-ID: <_V17jLLXsqUNAMDg00qqvq0QVDU7gyiBrP3lyGVQggA=.5beb25ec-4216-42c6-85a8-4d1dc1825abd@github.com> On Tue, 11 Oct 2022 01:19:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > rename macro guard Thanks a lot ------------- PR: https://git.openjdk.org/jdk/pull/8025 From vlivanov at openjdk.org Tue Oct 11 23:16:16 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 11 Oct 2022 23:16:16 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 16:50:28 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix x86 tests. Additional high-level comments. (1) Overall, I don't see (yet) enough motivation to justify the addition of `ReducedAllocationMerge`. I'd prefer to see the new node go away. `ReducedAllocationMerge` nodes are short-lived. As I can infer from the code, they are used solely for the purpose of caching field information (in addition to marking that the original phi satisfied some requirements). Have you considered using existing `ConnectionGraph` instance for bookkeeping purposes? It's available during IGVN and macro expansion. Also, I believe you face some ideal graph inconsistencies because you capture information too early (before `split_unique_types` and following IGVN pass; and previous allocation eliminations during `eliminate_macro_nodes()` may contribute to that). (2) Following up on my earlier question about interactions with `split_unique_types()`, I'm worried that you remove corresponding `LocalVar`s from the ConnectionGraph and introduce with unique memory slices. I'd feel much more confident in the correctness if you split slices for unions of interacting allocations instead. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From vlivanov at openjdk.org Tue Oct 11 23:25:18 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 11 Oct 2022 23:25:18 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 16:50:28 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix x86 tests. Minor comments. No need to act until we agree on high-level aspects of the patch. src/hotspot/share/opto/escape.cpp line 403: > 401: int ConnectionGraph::reduce_allocation_merges() { > 402: Unique_Node_List ideal_nodes; > 403: ideal_nodes.map(_compile->live_nodes(), NULL); You do a full pass over the whole graph to enumerate `Phi`s. Don't you achieve the same when enumerating `LocalVar`s from the CG? src/hotspot/share/opto/escape.cpp line 4075: > 4073: > 4074: // Update the memory inputs of ReducedAllocationMerge nodes > 4075: Unique_Node_List ram_nodes; Same here: pass over the whole graph to enumerate `ReducedAllocationMerge` nodes. For such cases, we keep node lists on Compile (see `GrowableArray` declared there). ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Wed Oct 12 01:06:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Oct 2022 01:06:08 GMT Subject: RFR: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 In-Reply-To: References: Message-ID: <0rQt0ingCCXwjojtQaF6U91kUhcU2zXBvZDlnDSZfQU=.cf7076fd-fac8-4532-bcd9-3667dd84e1e1@github.com> On Tue, 11 Oct 2022 12:19:05 GMT, Jatin Bhateja wrote: > Problem occurs in iterative DF analysis during CCP optimization, meet operations drops the speculative types before converging participating lattice values since [include_speculative argument it receives is always set to false ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/type.hpp#L231)where as [equality check ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.cpp#L1751) in the failing assertion is performed against original type still carrying the speculative type. > > To fix this, type comparison in the assertion should also be done after stripping the speculative type, with this change intermittent assertion failures in several vector API tests reported in the bug report are no longer seen. > > Kindly review and share your feedback. > > Best Regards, > Jatin Normal testing passed. Also repeat testing with `-XX:TypeProfileLevel=222` and don't see this issue anymore. `compiler/vectorapi/reshape/TestVectorReinterpret.java` test failed IR verification. But it is not related to these changes. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10648 From jbhateja at openjdk.org Wed Oct 12 01:07:54 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Oct 2022 01:07:54 GMT Subject: RFR: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms [v4] In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 00:06:20 GMT, Sandhya Viswanathan wrote: >>> Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? Currently they are only run for AVX512DQ platforms. >> >> I have added missing casting cases AVX/AVX2 and AVX512 targets in existing comprehensive test for [casting](test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java.) > > @jatin-bhateja Rest of the changes look good to me. Mainly the vector_op_pre_select_sz_estimate() needs to be corrected. Thanks @sviswa7 and @vnkozlov. ------------- PR: https://git.openjdk.org/jdk/pull/9748 From jbhateja at openjdk.org Wed Oct 12 01:09:23 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Oct 2022 01:09:23 GMT Subject: Integrated: 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms In-Reply-To: References: Message-ID: On Thu, 4 Aug 2022 16:20:10 GMT, Jatin Bhateja wrote: > Hi All, > > This patch extends conversion optimizations added with [JDK-8287835](https://bugs.openjdk.org/browse/JDK-8287835) to optimize following floating point to integral conversions for X86 AVX2 targets:- > * D2I , D2S, D2B, F2I , F2S, F2B > > In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature. > * D2I, D2S, D2B > > Following are the JMH micro performance results with and without patch. > > System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz) > > BENCHMARK | SIZE | BASELINE (ops/ms) | WITHOPT (ops/ms) | PERF GAIN FACTOR > -- | -- | -- | -- | -- > VectorFPtoIntCastOperations.microDouble128ToByte128 | 1024 | 90.603 | 92.797 | 1.024215534 > VectorFPtoIntCastOperations.microDouble128ToByte256 | 1024 | 81.909 | 82.3 | 1.00477359 > VectorFPtoIntCastOperations.microDouble128ToByte512 | 1024 | 26.181 | 26.244 | 1.002406325 > VectorFPtoIntCastOperations.microDouble128ToInteger128 | 1024 | 90.74 | 2537.958 | 27.96956138 > VectorFPtoIntCastOperations.microDouble128ToInteger256 | 1024 | 81.586 | 2429.599 | 29.7796068 > VectorFPtoIntCastOperations.microDouble128ToInteger512 | 1024 | 19.406 | 19.61 | 1.010512213 > VectorFPtoIntCastOperations.microDouble128ToLong128 | 1024 | 91.723 | 90.754 | 0.989435583 > VectorFPtoIntCastOperations.microDouble128ToShort128 | 1024 | 91.766 | 1984.577 | 21.62649565 > VectorFPtoIntCastOperations.microDouble128ToShort256 | 1024 | 81.949 | 1940.599 | 23.68056962 > VectorFPtoIntCastOperations.microDouble128ToShort512 | 1024 | 16.468 | 16.56 | 1.005586592 > VectorFPtoIntCastOperations.microDouble256ToByte128 | 1024 | 163.331 | 3018.351 | 18.479964 > VectorFPtoIntCastOperations.microDouble256ToByte256 | 1024 | 148.878 | 3082.034 | 20.70174237 > VectorFPtoIntCastOperations.microDouble256ToByte512 | 1024 | 50.108 | 51.629 | 1.030354434 > VectorFPtoIntCastOperations.microDouble256ToInteger128 | 1024 | 159.805 | 4619.421 | 28.90661118 > VectorFPtoIntCastOperations.microDouble256ToInteger256 | 1024 | 143.876 | 4649.642 | 32.31700909 > VectorFPtoIntCastOperations.microDouble256ToInteger512 | 1024 | 38.127 | 38.188 | 1.001599916 > VectorFPtoIntCastOperations.microDouble256ToLong128 | 1024 | 160.322 | 162.442 | 1.013223388 > VectorFPtoIntCastOperations.microDouble256ToLong256 | 1024 | 141.252 | 143.01 | 1.012445841 > VectorFPtoIntCastOperations.microDouble256ToShort128 | 1024 | 157.717 | 3757.471 | 23.82413437 > VectorFPtoIntCastOperations.microDouble256ToShort256 | 1024 | 143.876 | 3830.971 | 26.62689399 > VectorFPtoIntCastOperations.microDouble256ToShort512 | 1024 | 32.061 | 32.911 | 1.026511962 > VectorFPtoIntCastOperations.microFloat128ToByte128 | 1024 | 146.599 | 4002.967 | 27.30555461 > VectorFPtoIntCastOperations.microFloat128ToByte256 | 1024 | 136.99 | 3938.799 | 28.75245638 > VectorFPtoIntCastOperations.microFloat128ToByte512 | 1024 | 51.561 | 50.284 | 0.975233219 > VectorFPtoIntCastOperations.microFloat128ToInteger128 | 1024 | 5933.565 | 5361.472 | 0.903583596 > VectorFPtoIntCastOperations.microFloat128ToInteger256 | 1024 | 5079.564 | 5062.046 | 0.996551279 > VectorFPtoIntCastOperations.microFloat128ToInteger512 | 1024 | 37.101 | 38.419 | 1.035524649 > VectorFPtoIntCastOperations.microFloat128ToLong128 | 1024 | 145.863 | 145.362 | 0.99656527 > VectorFPtoIntCastOperations.microFloat128ToLong256 | 1024 | 131.159 | 133.154 | 1.015210546 > VectorFPtoIntCastOperations.microFloat128ToShort128 | 1024 | 145.966 | 4150.039 | 28.4315457 > VectorFPtoIntCastOperations.microFloat128ToShort256 | 1024 | 134.703 | 4566.589 | 33.90116775 > VectorFPtoIntCastOperations.microFloat128ToShort512 | 1024 | 31.878 | 30.867 | 0.968285338 > VectorFPtoIntCastOperations.microFloat256ToByte128 | 1024 | 237.841 | 6292.051 | 26.4548627 > VectorFPtoIntCastOperations.microFloat256ToByte256 | 1024 | 222.041 | 6292.748 | 28.34047766 > VectorFPtoIntCastOperations.microFloat256ToByte512 | 1024 | 92.073 | 88.981 | 0.966417951 > VectorFPtoIntCastOperations.microFloat256ToInteger128 | 1024 | 11471.121 | 10269.636 | 0.895260019 > VectorFPtoIntCastOperations.microFloat256ToInteger256 | 1024 | 10729.816 | 10105.92 | 0.941853989 > VectorFPtoIntCastOperations.microFloat256ToInteger512 | 1024 | 68.328 | 70.005 | 1.024543379 > VectorFPtoIntCastOperations.microFloat256ToLong128 | 1024 | 247.101 | 248.571 | 1.005948984 > VectorFPtoIntCastOperations.microFloat256ToLong256 | 1024 | 225.74 | 223.987 | 0.992234429 > VectorFPtoIntCastOperations.microFloat256ToLong512 | 1024 | 76.39 | 76.187 | 0.997342584 > VectorFPtoIntCastOperations.microFloat256ToShort128 | 1024 | 233.196 | 8202.179 | 35.17289748 > VectorFPtoIntCastOperations.microFloat256ToShort256 | 1024 | 220.75 | 7781.073 | 35.24834881 > VectorFPtoIntCastOperations.microFloat256ToShort512 | 1024 | 58.143 | 55.633 | 0.956830573 > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 2ceb80c6 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/2ceb80c60f2c1a479e5d79aac7d983e0bf29b253 Stats: 995 lines in 10 files changed: 796 ins; 55 del; 144 mod 8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/9748 From dzhang at openjdk.org Wed Oct 12 01:20:38 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 12 Oct 2022 01:20:38 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src [v2] In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Remove old hsdis Makefile ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10628/files - new: https://git.openjdk.org/jdk/pull/10628/files/21cc4d06..e937b657 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10628&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10628&range=00-01 Stats: 206 lines in 1 file changed: 0 ins; 206 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10628.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10628/head:pull/10628 PR: https://git.openjdk.org/jdk/pull/10628 From dzhang at openjdk.org Wed Oct 12 01:23:25 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 12 Oct 2022 01:23:25 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src [v3] In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Use COMPILE_TYPE and OPENJDK_TARGET_AUTOCONF_NAME instead ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10628/files - new: https://git.openjdk.org/jdk/pull/10628/files/e937b657..f18f31d9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10628&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10628&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10628.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10628/head:pull/10628 PR: https://git.openjdk.org/jdk/pull/10628 From dzhang at openjdk.org Wed Oct 12 01:25:22 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 12 Oct 2022 01:25:22 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src [v4] In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: <-d_kjF_i0HhCPvZZVFFMzJSj36MHbfxfoxXpNoJFvaA=.b4199717-eac0-4eae-ae61-f3ce96c86892@github.com> > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Use COMPILE_TYPE and OPENJDK_TARGET_AUTOCONF_NAME instead - Add hsdis-src support for cross-compile ------------- Changes: https://git.openjdk.org/jdk/pull/10628/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10628&range=03 Stats: 5 lines in 1 file changed: 4 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10628.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10628/head:pull/10628 PR: https://git.openjdk.org/jdk/pull/10628 From dzhang at openjdk.org Wed Oct 12 01:29:36 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 12 Oct 2022 01:29:36 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Tue, 11 Oct 2022 12:57:28 GMT, Aleksey Shipilev wrote: >> Ok, we already have an exported value for `$host`, which is `$OPENJDK_TARGET_AUTOCONF_NAME`. Also, `$conf_openjdk_target` is used in the wrapper configure script. It is probably leaking into the main generated autoconf script, but it is definitely not supposed to be used there. Instead, it should only be used to setup the `--host=` option to autoconf. So looking for `$host` is fine I suppose, but we should do it using the OPENJDK_TARGET_AUTOCONF_NAME variable. > >> Ok, we already have an exported value for `$host`, which is `$OPENJDK_TARGET_AUTOCONF_NAME`. Also, `$conf_openjdk_target` is used in the wrapper configure script. It is probably leaking into the main generated autoconf script, but it is definitely not supposed to be used there. Instead, it should only be used to setup the `--host=` option to autoconf. So looking for `$host` is fine I suppose, but we should do it using the OPENJDK_TARGET_AUTOCONF_NAME variable. > > Quite! > > Applying this patch over the PR: > > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index dddc1cf6a4d..72bd08c7108 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,10 +175,10 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - if test "x$host" = "x$build"; then > - binutils_target="" > + if test "x$COMPILE_TYPE" = xcross; then > + binutils_target="--host=$OPENJDK_TARGET_AUTOCONF_NAME" > else > - binutils_target="--host=$host" > + binutils_target="" > fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > ...successfully produces the hsdis binaries on all these platforms: > > > server-release-aarch64-linux-gnu-10 > server-release-arm-linux-gnueabihf-10 > server-release-i686-linux-gnu-10 > server-release-powerpc64le-linux-gnu-10 > server-release-powerpc64-linux-gnu-10 > server-release-riscv64-linux-gnu-10 > server-release-s390x-linux-gnu-10 > server-release-x86_64-linux-gnu-10 > zero-release-aarch64-linux-gnu-10 > zero-release-alpha-linux-gnu-10 > zero-release-arm-linux-gnueabi-10 > zero-release-arm-linux-gnueabihf-10 > zero-release-i686-linux-gnu-10 > zero-release-m68k-linux-gnu-10 > zero-release-mips64el-linux-gnuabi64-10 > zero-release-mipsel-linux-gnu-10 > zero-release-powerpc64le-linux-gnu-10 > zero-release-powerpc64-linux-gnu-10 > zero-release-powerpc-linux-gnu-10 > zero-release-riscv64-linux-gnu-10 > zero-release-s390x-linux-gnu-10 > zero-release-sh4-linux-gnu-10 > zero-release-sparc64-linux-gnu-10 > zero-release-x86_64-linux-gnu-10 > > > Therefore, I believe this is what we should do and then call it a day. (Then I also need to start building all these hsdis-es at [https://builds.shipilev.net/hsdis/](https://builds.shipilev.net/hsdis/)) @shipilev @magicus Thanks for the review! I have skipped the demo Makefile changes from this PR and applied the patch from @shipilev. ------------- PR: https://git.openjdk.org/jdk/pull/10628 From duke at openjdk.org Wed Oct 12 01:38:59 2022 From: duke at openjdk.org (Joshua Cao) Date: Wed, 12 Oct 2022 01:38:59 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level Message-ID: Example: [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true 223 12 3 java.lang.String::length (11 bytes) 405 307 4 java.lang.String::length (11 bytes) hello world Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. --- Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. ------------- Commit messages: - BooleanTest does not rely on PrintCompilation - 8255746: Make PrintCompilation available on a per method level Changes: https://git.openjdk.org/jdk/pull/10668/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10668&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8255746 Stats: 53 lines in 4 files changed: 8 ins; 41 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10668.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10668/head:pull/10668 PR: https://git.openjdk.org/jdk/pull/10668 From xgong at openjdk.org Wed Oct 12 01:44:15 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 12 Oct 2022 01:44:15 GMT Subject: RFR: 8292898: [vectorapi] Unify vector mask cast operation [v9] In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 07:25:56 GMT, Xiaohong Gong wrote: >> The current implementation of the vector mask cast operation is >> complex that the compiler generates different patterns for different >> scenarios. For architectures that do not support the predicate >> feature, vector mask is represented the same as the normal vector. >> So the vector mask cast is implemented by `VectorCast `node. But this >> is not always needed. When two masks have the same element size (e.g. >> int vs. float), their bits layout are the same. So casting between >> them does not need to emit any instructions. >> >> Currently the compiler generates different patterns based on the >> vector type of the input/output and the platforms. Normally the >> "`VectorMaskCast`" op is only used for cases that doesn't emit any >> instructions, and "`VectorCast`" op is used to implement the necessary >> expand/narrow operations. This can avoid adding some duplicate rules >> in the backend. However, this also has the drawbacks: >> >> 1) The codes are complex, especially when the compiler needs to >> check whether the hardware supports the necessary IRs for the >> vector mask cast. It needs to check different patterns for >> different cases. >> 2) The vector mask cast operation could be implemented with cheaper >> instructions than the vector casting on some architectures. >> >> Instead of generating `VectorCast `or `VectorMaskCast `nodes for different >> cases of vector mask cast operations, this patch unifies the vector >> mask cast implementation with "`VectorMaskCast`" node for all vector types >> and platforms. The missing backend rules are also added for it. >> >> This patch also simplies the vector mask conversion happened in >> "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can >> be optimized to "`vmask`" if the unboxing type matches with the boxed >> "`vmask`" type. Otherwise, it needs the type conversion. Currently the >> "`VectorUnbox`" will be transformed to two different patterns to implement >> the conversion: >> >> 1) If the element size is not changed, it is transformed to: >> >> "VectorMaskCast vmask" >> >> 2) Otherwise, it is transformed to: >> >> "VectorLoadMask (VectorStoreMask vmask)" >> >> It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", >> and then uses "`VectorLoadMask`" to convert the boolean vector to the >> dst mask vector. Since this patch makes "`VectorMaskCast`" op supported >> for all types on all platforms, it doesn't need the "`VectorLoadMask`" and >> "`VectorStoreMask`" to do the conversion. The existing transformation: >> >> VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) >> >> can be simplified to: >> >> VectorUnbox (VectorBox vmask) => VectorMaskCast vmask > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: > > - Merge latest 'jdk:master' > - Use "setDefaultWarmup" instead of adding the annotation for each test > - Merge branch 'jdk:master' into JDK-8292898 > - Change to use "avx512vl" cpu feature for some IR tests > - Add the IR test and fix review comments on x86 backend > - Remove untaken code paths on x86 match rules > - Add assertion to the elem num for mast cast > - Merge branch 'jdk:master' into JDK-8292898 > - 8292898: [vectorapi] Unify vector mask cast operation > - Merge branch 'jdk:master' into JDK-8291600 > - ... and 8 more: https://git.openjdk.org/jdk/compare/9d116ec1...5aab47d5 The GHA test pass all here https://github.com/XiaohongGong/jdk/actions/runs/3224880587 ------------- PR: https://git.openjdk.org/jdk/pull/10192 From xgong at openjdk.org Wed Oct 12 01:44:15 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 12 Oct 2022 01:44:15 GMT Subject: Integrated: 8292898: [vectorapi] Unify vector mask cast operation In-Reply-To: References: Message-ID: On Wed, 7 Sep 2022 06:03:51 GMT, Xiaohong Gong wrote: > The current implementation of the vector mask cast operation is > complex that the compiler generates different patterns for different > scenarios. For architectures that do not support the predicate > feature, vector mask is represented the same as the normal vector. > So the vector mask cast is implemented by `VectorCast `node. But this > is not always needed. When two masks have the same element size (e.g. > int vs. float), their bits layout are the same. So casting between > them does not need to emit any instructions. > > Currently the compiler generates different patterns based on the > vector type of the input/output and the platforms. Normally the > "`VectorMaskCast`" op is only used for cases that doesn't emit any > instructions, and "`VectorCast`" op is used to implement the necessary > expand/narrow operations. This can avoid adding some duplicate rules > in the backend. However, this also has the drawbacks: > > 1) The codes are complex, especially when the compiler needs to > check whether the hardware supports the necessary IRs for the > vector mask cast. It needs to check different patterns for > different cases. > 2) The vector mask cast operation could be implemented with cheaper > instructions than the vector casting on some architectures. > > Instead of generating `VectorCast `or `VectorMaskCast `nodes for different > cases of vector mask cast operations, this patch unifies the vector > mask cast implementation with "`VectorMaskCast`" node for all vector types > and platforms. The missing backend rules are also added for it. > > This patch also simplies the vector mask conversion happened in > "`VectorUnbox::Ideal()`". Normally "`VectorUnbox (VectorBox vmask)`" can > be optimized to "`vmask`" if the unboxing type matches with the boxed > "`vmask`" type. Otherwise, it needs the type conversion. Currently the > "`VectorUnbox`" will be transformed to two different patterns to implement > the conversion: > > 1) If the element size is not changed, it is transformed to: > > "VectorMaskCast vmask" > > 2) Otherwise, it is transformed to: > > "VectorLoadMask (VectorStoreMask vmask)" > > It firstly converts the "`vmask`" to a boolean vector with "`VectorStoreMask`", > and then uses "`VectorLoadMask`" to convert the boolean vector to the > dst mask vector. Since this patch makes "`VectorMaskCast`" op supported > for all types on all platforms, it doesn't need the "`VectorLoadMask`" and > "`VectorStoreMask`" to do the conversion. The existing transformation: > > VectorUnbox (VectorBox vmask) => VectorLoadMask (VectorStoreMask vmask) > > can be simplified to: > > VectorUnbox (VectorBox vmask) => VectorMaskCast vmask This pull request has now been integrated. Changeset: ab8c1361 Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/ab8c1361bc03a8afe016c82f1ad3da9204626d72 Stats: 596 lines in 10 files changed: 289 ins; 141 del; 166 mod 8292898: [vectorapi] Unify vector mask cast operation Co-authored-by: Quan Anh Mai Reviewed-by: jbhateja, eliu ------------- PR: https://git.openjdk.org/jdk/pull/10192 From fgao at openjdk.org Wed Oct 12 01:54:15 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 12 Oct 2022 01:54:15 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v2] In-Reply-To: References: Message-ID: > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > > > But we don't support the case like: > > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) { > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > > because the IR nodes for the CMoveD in the loop is: > > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node > as commented in CMoveKit::make_cmovevd_pack(). Because > we can't vectorize the CMoveD pack, we shouldn't vectorize > its inputs, AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD > node with two vector inputs, AddVD and SubVD, which has > wrong mixing types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack > from the packset and clear related map info. Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Refine the function and clean up the code style - Merge branch 'master' into fg8293833 - 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize the case below by enabling -XX:+UseCMoveUnconditionally and -XX:+UseVectorCmov: ``` // double[] a, double[] b, double[] c; for (int i = 0; i < a.length; i++) { c[i] = (a[i] > b[i]) ? a[i] : b[i]; } ``` But we don't support the case like: ``` // double[] a; // int seed; for (int i = 0; i < a.length; i++) a[i] = (i % 2 == 0) ? seed + i : seed - i; } ``` because the IR nodes for the CMoveD in the loop is: ``` AddI AndI AddD SubD \ / / / CmpI / / \ / / Bool / / \ / / CMoveD ``` and it is not our target pattern, which requires that the inputs of Cmp node must be the same as the inputs of CMove node as commented in CMoveKit::make_cmovevd_pack(). Because we can't vectorize the CMoveD pack, we shouldn't vectorize its inputs, AddD and SubD. But the current function CMoveKit::make_cmovevd_pack() doesn't clear the unqualified CMoveD pack from the packset. In this way, superword wrongly vectorizes AddD and SubD. Finally, we get a scalar CMoveD node with two vector inputs, AddVD and SubVD, which has wrong mixing types, then the assertion fails. To fix it, we need to remove the unvectorized CMoveD pack from the packset and clear related map info. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10627/files - new: https://git.openjdk.org/jdk/pull/10627/files/1b615da3..a113675d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10627&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10627&range=00-01 Stats: 1024 lines in 55 files changed: 486 ins; 230 del; 308 mod Patch: https://git.openjdk.org/jdk/pull/10627.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10627/head:pull/10627 PR: https://git.openjdk.org/jdk/pull/10627 From fgao at openjdk.org Wed Oct 12 01:54:16 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 12 Oct 2022 01:54:16 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v2] In-Reply-To: References: Message-ID: <406U9D4XP64C-C4W-oArQAZODId-l6w8CAaFD2jgyQ0=.ac99bd0f-db21-46b8-85cf-2cca6bf5ab8c@github.com> On Mon, 10 Oct 2022 08:49:14 GMT, Christian Hagedorn wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Refine the function and clean up the code style >> - Merge branch 'master' into fg8293833 >> - 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov >> >> After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize >> the case below by enabling -XX:+UseCMoveUnconditionally and >> -XX:+UseVectorCmov: >> ``` >> // double[] a, double[] b, double[] c; >> for (int i = 0; i < a.length; i++) { >> c[i] = (a[i] > b[i]) ? a[i] : b[i]; >> } >> ``` >> >> But we don't support the case like: >> ``` >> // double[] a; >> // int seed; >> for (int i = 0; i < a.length; i++) >> a[i] = (i % 2 == 0) ? seed + i : seed - i; >> } >> ``` >> because the IR nodes for the CMoveD in the loop is: >> ``` >> AddI AndI AddD SubD >> \ / / / >> CmpI / / >> \ / / >> Bool / / >> \ / / >> CMoveD >> ``` >> >> and it is not our target pattern, which requires that the inputs >> of Cmp node must be the same as the inputs of CMove node as >> commented in CMoveKit::make_cmovevd_pack(). Because we can't >> vectorize the CMoveD pack, we shouldn't vectorize its inputs, >> AddD and SubD. But the current function >> CMoveKit::make_cmovevd_pack() doesn't clear the unqualified >> CMoveD pack from the packset. In this way, superword wrongly >> vectorizes AddD and SubD. Finally, we get a scalar CMoveD node >> with two vector inputs, AddVD and SubVD, which has wrong mixing >> types, then the assertion fails. >> >> To fix it, we need to remove the unvectorized CMoveD pack from >> the packset and clear related map info. > > src/hotspot/share/opto/superword.cpp line 1981: > >> 1979: } >> 1980: >> 1981: Node_List* new_cmpd_pk = new Node_List(); > > The following suggestion is just an idea as I was a little bit confused by how you use the return value of `make_cmovevd_pack` to remove the cmove pack and its related packs. Intuitively, I would have expected that this "make method" returns the newly created pack instead. > > Maybe it's cleaner if you split this method into a "should merge" method with the check > > if ((cmovd->Opcode() != Op_CMoveF && cmovd->Opcode() != Op_CMoveD) || > pack(cmovd) != NULL /* already in the cmov pack */) { > return NULL; > } > > a "can merge" method that checks all the other constraints and an actual "make pack" method with the code starting at this line. Then you could use these methods in `merge_packs_to_cmovd` like that in pseudo-code: > > void SuperWord::merge_packs_to_cmovd() { > for (int i = _packset.length() - 1; i >= 0; i--) { > Node_List* pack = _packset.at(i); > if (_cmovev_kit.should_merge(pack)) { > if (_cmovev_kit.can_merge(pack)) { > _cmovev_kit.make_cmovevd_pack(pack) > } else { > remove_cmove_and_related_packs(pack); > } > } > } > ... @chhagedorn thanks for your great suggestion. It did make the code much easier to understand. I've done the refactoring in the latest commit. Please help review. Thanks for your time! > test/hotspot/jtreg/compiler/c2/TestCondAddDeadBranch.java line 32: > >> 30: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:CompileOnly=TestCondAddDeadBranch TestCondAddDeadBranch >> 31: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:CompileOnly=TestCondAddDeadBranch >> 32: * -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov -XX:MaxVectorSize=32 TestCondAddDeadBranch > > As the cmove flags are C2 specific, we should also add a `@requires vm.compiler2.enabled`. Same for the other test. Done. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10627 From fgao at openjdk.org Wed Oct 12 02:41:03 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 12 Oct 2022 02:41:03 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov In-Reply-To: <3L3jBCCbTRD4b1tmu1m3L6kxmZJYWs6S7x9frmO87-k=.0a86f24e-7f81-41b5-9ac8-9e039b3cf240@github.com> References: <3L3jBCCbTRD4b1tmu1m3L6kxmZJYWs6S7x9frmO87-k=.0a86f24e-7f81-41b5-9ac8-9e039b3cf240@github.com> Message-ID: <43OMu9M8XuTJnnmaPWlI6diDiHJfFhpIBU73b7otjJ8=.cb7bc181-32ca-4c0c-925f-5c4c1b0065a3@github.com> On Mon, 10 Oct 2022 09:32:56 GMT, Quan Anh Mai wrote: > May I ask if we can vectorise `Bool -> Cmp` into `VectorMaskCmp` and `CMove` into `VectorBlend`, this would help vectorise the pattern you mention in the description instead of bailing out? Thanks. @merykitty Thanks for your kind review and question. That's really an interesting idea. IMO, vectorizing `Bool -> Cmp` and `CMove` separately, to support more cases, deserves a deep investigation. I'm not sure if it's feasible. But for the case in the description, even trying the idea, we still can't vectorize the case because we can't vectorize `i % 2` currently. In this way, we can't vectorize any chain involving `i % 2`. ------------- PR: https://git.openjdk.org/jdk/pull/10627 From eliu at openjdk.org Wed Oct 12 05:51:04 2022 From: eliu at openjdk.org (Eric Liu) Date: Wed, 12 Oct 2022 05:51:04 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Mon, 19 Sep 2022 08:51:24 GMT, Xiaohong Gong wrote: > "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. > > This patch adds the vector intrinsic implementation of it. The steps are: > > 1) Load the const "iota" vector. > > We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. > > 2) Compute indexes with "`vec + iota * scale`" > > Here is the performance result to the new added micro benchmark on ARM NEON: > > Benchmark Gain > IndexVectorBenchmark.byteIndexVector 1.477 > IndexVectorBenchmark.doubleIndexVector 5.031 > IndexVectorBenchmark.floatIndexVector 5.342 > IndexVectorBenchmark.intIndexVector 5.529 > IndexVectorBenchmark.longIndexVector 3.177 > IndexVectorBenchmark.shortIndexVector 5.841 > > > Please help to review and share the feedback! Thanks in advance! AArch64 part looks good to me. ------------- Marked as reviewed by eliu (Committer). PR: https://git.openjdk.org/jdk/pull/10332 From ihse at openjdk.org Wed Oct 12 06:23:18 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 12 Oct 2022 06:23:18 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src [v4] In-Reply-To: <-d_kjF_i0HhCPvZZVFFMzJSj36MHbfxfoxXpNoJFvaA=.b4199717-eac0-4eae-ae61-f3ce96c86892@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> <-d_kjF_i0HhCPvZZVFFMzJSj36MHbfxfoxXpNoJFvaA=.b4199717-eac0-4eae-ae61-f3ce96c86892@github.com> Message-ID: On Wed, 12 Oct 2022 01:25:22 GMT, Dingli Zhang wrote: >> I built hsdis with the following parameters from source code of binutils while cross-compiling: >> >> --with-hsdis=binutils \ >> --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 >> >> >> But configure will exit with the following error: >> >> checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': >> configure: error: cannot run C compiled programs. >> If you meant to cross compile, use `--host'. >> See `config.log' for more details >> configure: Automatic building of binutils failed on configure. Try building it manually >> configure: error: Cannot continue >> configure exiting with result code 1 >> >> >> The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: >> >> diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 >> index d72bbf6df32..dddc1cf6a4d 100644 >> --- a/make/autoconf/lib-hsdis.m4 >> +++ b/make/autoconf/lib-hsdis.m4 >> @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], >> fi >> else >> binutils_cc="$CC $SYSROOT_CFLAGS" >> - binutils_target="" >> + if test "x$host" = "x$build"; then >> + binutils_target="" >> + else >> + binutils_target="--host=$host" >> + fi >> fi >> binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" >> >> >> >> In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . >> >> ## Testing: >> >> - cross compile for RISC-V on x86_64 > > Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Use COMPILE_TYPE and OPENJDK_TARGET_AUTOCONF_NAME instead > - Add hsdis-src support for cross-compile Thank you. This looks good now. ------------- Marked as reviewed by ihse (Reviewer). PR: https://git.openjdk.org/jdk/pull/10628 From shade at openjdk.org Wed Oct 12 06:27:37 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Oct 2022 06:27:37 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src [v4] In-Reply-To: <-d_kjF_i0HhCPvZZVFFMzJSj36MHbfxfoxXpNoJFvaA=.b4199717-eac0-4eae-ae61-f3ce96c86892@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> <-d_kjF_i0HhCPvZZVFFMzJSj36MHbfxfoxXpNoJFvaA=.b4199717-eac0-4eae-ae61-f3ce96c86892@github.com> Message-ID: On Wed, 12 Oct 2022 01:25:22 GMT, Dingli Zhang wrote: >> I built hsdis with the following parameters from source code of binutils while cross-compiling: >> >> --with-hsdis=binutils \ >> --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 >> >> >> But configure will exit with the following error: >> >> checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': >> configure: error: cannot run C compiled programs. >> If you meant to cross compile, use `--host'. >> See `config.log' for more details >> configure: Automatic building of binutils failed on configure. Try building it manually >> configure: error: Cannot continue >> configure exiting with result code 1 >> >> >> The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: >> >> diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 >> index d72bbf6df32..dddc1cf6a4d 100644 >> --- a/make/autoconf/lib-hsdis.m4 >> +++ b/make/autoconf/lib-hsdis.m4 >> @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], >> fi >> else >> binutils_cc="$CC $SYSROOT_CFLAGS" >> - binutils_target="" >> + if test "x$host" = "x$build"; then >> + binutils_target="" >> + else >> + binutils_target="--host=$host" >> + fi >> fi >> binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" >> >> >> >> In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . >> >> ## Testing: >> >> - cross compile for RISC-V on x86_64 > > Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Use COMPILE_TYPE and OPENJDK_TARGET_AUTOCONF_NAME instead > - Add hsdis-src support for cross-compile Testing... ------------- PR: https://git.openjdk.org/jdk/pull/10628 From shade at openjdk.org Wed Oct 12 07:24:13 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Oct 2022 07:24:13 GMT Subject: RFR: 8295033: hsdis configure error when cross-compiling with --with-binutils-src [v4] In-Reply-To: <-d_kjF_i0HhCPvZZVFFMzJSj36MHbfxfoxXpNoJFvaA=.b4199717-eac0-4eae-ae61-f3ce96c86892@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> <-d_kjF_i0HhCPvZZVFFMzJSj36MHbfxfoxXpNoJFvaA=.b4199717-eac0-4eae-ae61-f3ce96c86892@github.com> Message-ID: On Wed, 12 Oct 2022 01:25:22 GMT, Dingli Zhang wrote: >> I built hsdis with the following parameters from source code of binutils while cross-compiling: >> >> --with-hsdis=binutils \ >> --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 >> >> >> But configure will exit with the following error: >> >> checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': >> configure: error: cannot run C compiled programs. >> If you meant to cross compile, use `--host'. >> See `config.log' for more details >> configure: Automatic building of binutils failed on configure. Try building it manually >> configure: error: Cannot continue >> configure exiting with result code 1 >> >> >> The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: >> >> diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 >> index d72bbf6df32..dddc1cf6a4d 100644 >> --- a/make/autoconf/lib-hsdis.m4 >> +++ b/make/autoconf/lib-hsdis.m4 >> @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], >> fi >> else >> binutils_cc="$CC $SYSROOT_CFLAGS" >> - binutils_target="" >> + if test "x$host" = "x$build"; then >> + binutils_target="" >> + else >> + binutils_target="--host=$host" >> + fi >> fi >> binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" >> >> >> >> In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . >> >> ## Testing: >> >> - cross compile for RISC-V on x86_64 > > Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Use COMPILE_TYPE and OPENJDK_TARGET_AUTOCONF_NAME instead > - Add hsdis-src support for cross-compile Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10628 From dzhang at openjdk.org Wed Oct 12 07:27:30 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 12 Oct 2022 07:27:30 GMT Subject: Integrated: 8295033: hsdis configure error when cross-compiling with --with-binutils-src In-Reply-To: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> References: <7Wdr-gXbChQY_IO3AugyoiHBW9EMfY8QV1mkqf4rO9s=.8a19002f-247b-47a3-bd6a-baf60db38f3a@github.com> Message-ID: On Mon, 10 Oct 2022 06:32:09 GMT, Dingli Zhang wrote: > I built hsdis with the following parameters from source code of binutils while cross-compiling: > > --with-hsdis=binutils \ > --with-binutils-src=/home/dingli/jdk-tools/binutils-2.38 > > > But configure will exit with the following error: > > checking whether we are cross compiling... configure: error: in `/home/dingli/jdk-tools/binutils-2.38-src': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: Automatic building of binutils failed on configure. Try building it manually > configure: error: Cannot continue > configure exiting with result code 1 > > > The reason for the error is that binutils wants to be configured with --host during cross-compilation. So we can determine if we are currently cross-compiling and add the --host parameter to binutils_target: > > diff --git a/make/autoconf/lib-hsdis.m4 b/make/autoconf/lib-hsdis.m4 > index d72bbf6df32..dddc1cf6a4d 100644 > --- a/make/autoconf/lib-hsdis.m4 > +++ b/make/autoconf/lib-hsdis.m4 > @@ -175,7 +175,11 @@ AC_DEFUN([LIB_BUILD_BINUTILS], > fi > else > binutils_cc="$CC $SYSROOT_CFLAGS" > - binutils_target="" > + if test "x$host" = "x$build"; then > + binutils_target="" > + else > + binutils_target="--host=$host" > + fi > fi > binutils_cflags="$binutils_cflags $MACHINE_FLAG $JVM_PICFLAG $C_O_FLAG_NORM" > > > > In the meantime, I removed some useless code about hsdis-demo because hsdis-demo.c was removed in [JDK-8275128](https://bugs.openjdk.org/browse/JDK-8275128) . > > ## Testing: > > - cross compile for RISC-V on x86_64 This pull request has now been integrated. Changeset: 392f35df Author: Dingli Zhang Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/392f35df4be1a9a8d7a67a25ae01230c7dd060ac Stats: 5 lines in 1 file changed: 4 ins; 0 del; 1 mod 8295033: hsdis configure error when cross-compiling with --with-binutils-src Reviewed-by: erikj, ihse, shade ------------- PR: https://git.openjdk.org/jdk/pull/10628 From chagedorn at openjdk.org Wed Oct 12 07:30:47 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 12 Oct 2022 07:30:47 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 01:31:45 GMT, Joshua Cao wrote: > Example: > > > [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello > CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true > 223 12 3 java.lang.String::length (11 bytes) > 405 307 4 java.lang.String::length (11 bytes) > hello world > > > Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. > > --- > > Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. > > I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. That's a nice improvement which will also be beneficial for the IR framework. At the moment, the IR framework just prints all methods with `PrintCompilation`. Fine-tuning the VM to only print the methods on which we IR match on would improve the performance. src/hotspot/share/compiler/compileBroker.cpp line 2138: > 2136: task->print_tty(); > 2137: } > 2138: elapsedTimer time; Shouldn't this timer start at the very top of the method? Printing needs time but it kinda adds to the compilation time, so it's probably not wrong to count that time as well when explicitly using print flags. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From haosun at openjdk.org Wed Oct 12 10:50:10 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 12 Oct 2022 10:50:10 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension In-Reply-To: References: Message-ID: On Fri, 23 Sep 2022 11:13:40 GMT, Bhavana Kilambi wrote: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. test/hotspot/jtreg/compiler/vectorization/TestEor3AArch64.java line 38: > 36: * @summary Test EOR3 Neon/SVE2 instruction for aarch64 SHA3 extension > 37: * @library /test/lib / > 38: * @requires os.arch == "aarch64" & vm.cpu.features ~=".*sha3.*" Suggestion: * @requires os.arch == "aarch64" & vm.cpu.features ~= ".*sha3.*" nit: [style] it's better to have one extra space. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From qamai at openjdk.org Wed Oct 12 11:58:47 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Oct 2022 11:58:47 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v12] In-Reply-To: References: Message-ID: <1qzngp8Z8spVxoU3C8PxQgqkCJFw3anZqp8_mn8qI2s=.2db33f71-30cf-4365-9ba6-d05146fc8771@github.com> > This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: > > vptest xmm0, xmm1 > jb if_true > if_false: > > instead of: > > vptest xmm0, xmm1 > setb r10 > movzbl r10 > testl r10 > jne if_true > if_false: > > The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: > > Before After > Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change > ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% > > I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 29 commits: - Merge branch 'master' into improveVTest - redundant casts - remove untaken code paths - Merge branch 'master' into improveVTest - Merge branch 'master' into improveVTest - Merge branch 'master' into improveVTest - fix merge problems - Merge branch 'master' into improveVTest - refactor x86 - revert renaming temp - ... and 19 more: https://git.openjdk.org/jdk/compare/86ec158d...05c1b9f5 ------------- Changes: https://git.openjdk.org/jdk/pull/9855/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9855&range=11 Stats: 492 lines in 23 files changed: 212 ins; 171 del; 109 mod Patch: https://git.openjdk.org/jdk/pull/9855.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9855/head:pull/9855 PR: https://git.openjdk.org/jdk/pull/9855 From duke at openjdk.org Wed Oct 12 13:25:10 2022 From: duke at openjdk.org (Joshua Cao) Date: Wed, 12 Oct 2022 13:25:10 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 07:18:04 GMT, Christian Hagedorn wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > src/hotspot/share/compiler/compileBroker.cpp line 2138: > >> 2136: task->print_tty(); >> 2137: } >> 2138: elapsedTimer time; > > Shouldn't this timer start at the very top of the method? Printing needs time but it kinda adds to the compilation time, so it's probably not wrong to count that time as well when explicitly using print flags. I've kept it this way because the timer was started after printing prior to this change as well. If a dev is benchmarking compilation time, it is better to not include the timer. Printing adds noise that is not there during real-world runs. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From chagedorn at openjdk.org Wed Oct 12 13:36:09 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 12 Oct 2022 13:36:09 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v2] In-Reply-To: References: Message-ID: <4tYsZFZ4og659mb3lTR-63DgWfw856o_Q5y9LoJp90k=.b9b4fc64-22e7-476e-b5da-69827986fe0a@github.com> On Wed, 12 Oct 2022 01:54:15 GMT, Fei Gao wrote: >> After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize >> the case below by enabling -XX:+UseCMoveUnconditionally and >> -XX:+UseVectorCmov: >> >> // double[] a, double[] b, double[] c; >> for (int i = 0; i < a.length; i++) { >> c[i] = (a[i] > b[i]) ? a[i] : b[i]; >> } >> >> >> But we don't support the case like: >> >> // double[] a; >> // int seed; >> for (int i = 0; i < a.length; i++) { >> a[i] = (i % 2 == 0) ? seed + i : seed - i; >> } >> >> because the IR nodes for the CMoveD in the loop is: >> >> AddI AndI AddD SubD >> \ / / / >> CmpI / / >> \ / / >> Bool / / >> \ / / >> CMoveD >> >> >> and it is not our target pattern, which requires that the inputs >> of Cmp node must be the same as the inputs of CMove node >> as commented in CMoveKit::make_cmovevd_pack(). Because >> we can't vectorize the CMoveD pack, we shouldn't vectorize >> its inputs, AddD and SubD. But the current function >> CMoveKit::make_cmovevd_pack() doesn't clear the unqualified >> CMoveD pack from the packset. In this way, superword wrongly >> vectorizes AddD and SubD. Finally, we get a scalar CMoveD >> node with two vector inputs, AddVD and SubVD, which has >> wrong mixing types, then the assertion fails. >> >> To fix it, we need to remove the unvectorized CMoveD pack >> from the packset and clear related map info. > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Refine the function and clean up the code style > - Merge branch 'master' into fg8293833 > - 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov > > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > ``` > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > ``` > > But we don't support the case like: > ``` > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > ``` > because the IR nodes for the CMoveD in the loop is: > ``` > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > ``` > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node as > commented in CMoveKit::make_cmovevd_pack(). Because we can't > vectorize the CMoveD pack, we shouldn't vectorize its inputs, > AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD node > with two vector inputs, AddVD and SubVD, which has wrong mixing > types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack from > the packset and clear related map info. Thanks for doing the update, looks good to me! It's indeed much easier to follow the code now. I'll submit some testing. src/hotspot/share/opto/superword.cpp line 1936: > 1934: > 1935: bool CMoveKit::is_cmove_pack_candidate(Node_List* cmove_pk) { > 1936: Node *cmove = cmove_pk->at(0); Suggestion: Node* cmove = cmove_pk->at(0); src/hotspot/share/opto/superword.cpp line 1947: > 1945: // i.e. bool node pack and cmp node pack, can be successfully merged for vectorization. > 1946: bool CMoveKit::can_merge_cmove_pack(Node_List* cmove_pk) { > 1947: Node *cmove = cmove_pk->at(0); Suggestion: Node* cmove = cmove_pk->at(0); src/hotspot/share/opto/superword.cpp line 1993: > 1991: // new pack and delete the old cmove pack and related packs from the packset. > 1992: void CMoveKit::make_cmove_pack(Node_List* cmove_pk) { > 1993: Node *cmove = cmove_pk->at(0); Suggestion: Node* cmove = cmove_pk->at(0); ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10627 From xxinliu at amazon.com Wed Oct 12 14:58:29 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Wed, 12 Oct 2022 07:58:29 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> Message-ID: <0bc75ee6-641f-1145-8fde-6d11e2ec887e@amazon.com> hi, Vladimir, > You should show that your implementation can rematirealize an object at any escape site. My understanding is I suppose to 'materialize' an object at any escape site. 'rematerialize' refers to 'create an scalar-replaced object on heap' in deoptimization. It's for interpreter as if the object was created in the first place. It doesn't apply to an escaped object because it's marked 'GlobalEscaped' in C2 EA. Okay. I will try this idea! thanks, --lx On 10/11/22 3:12 PM, Vladimir Kozlov wrote: > Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore. -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From tanksherman27 at gmail.com Wed Oct 12 15:18:18 2022 From: tanksherman27 at gmail.com (Julian Waters) Date: Wed, 12 Oct 2022 23:18:18 +0800 Subject: Point where code installed by C1 and C2 is loaded? Message-ID: Hi all, I apologise if this is a silly question, but where exactly is code installed by C1 and C2 loaded and executed by the runtime? I've tried looking through the entirety of hotspot, but haven't found anything that seems to be related. All I can surmise is that the compiler interface ultimately creates an nmethod that allocates itself on the CodeCache using a CodeBuffer containing C1 or C2 emitted instructions, and then Method::set_code sets the method's _code to reference that entry in the cache (or method->method_holder()->add_osr_nmethod(nm); is called in other circumstances, I don't quite understand what the difference is but I assume the end result is probably the same). Given my rudimentary understanding of hotspot's execution pipeline I'd assume that when a new frame (frame.hpp) is created, the frame's code blob would be set to reference the nmethod in the method that was called, or otherwise somehow jump back to the interpreter if that method hasn't been compiled yet. But there doesn't seem to be any point where method->code() is called to load the instructions emitted by either C1 or C2 into a frame, so where does that happen? I guess this is probably more a question of how hotspot runs loaded programs in general, which seems to me at a glance like it's chaining assembly in CodeBlobs together and jumping to the next blob/codelet (in the next frame?) when it's finished, but I can't really figure out where those codelets are set for each frame, or how it chooses between one compiled by C1 or C2, and the handwritten assembly codelets that make up the interpreter (or for that matter how it even finds the correct interpreter codelet). I appreciate any help with this query, sorry if this isn't the correct list to post this question to, but it seemed like the most appropriate. best regards, Julian -------------- next part -------------- An HTML attachment was scrubbed... URL: From chagedorn at openjdk.org Wed Oct 12 15:32:05 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 12 Oct 2022 15:32:05 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: <5IRZaWg2_zpcI95aHeI77y_QlUwg7-A9zFBHk6qwLL8=.3cbe4abc-3c39-4534-82ab-6af4e5243942@github.com> On Wed, 12 Oct 2022 13:21:38 GMT, Joshua Cao wrote: >> src/hotspot/share/compiler/compileBroker.cpp line 2138: >> >>> 2136: task->print_tty(); >>> 2137: } >>> 2138: elapsedTimer time; >> >> Shouldn't this timer start at the very top of the method? Printing needs time but it kinda adds to the compilation time, so it's probably not wrong to count that time as well when explicitly using print flags. > > I've kept it this way because the timer was started after printing prior to this change as well. If a dev is benchmarking compilation time, it is better to not include the timer. Printing adds noise that is not there during real-world runs. Okay, I don't have a strong opinion about it. In the old code, we somehow did it in-between: We printed UL, then with `PrintCompilation`, then started the timer and then printed to `CompilationLog::log()`. So, I'm not sure what the original intention was. Maybe someone else can comment on that, too. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From xliu at openjdk.org Wed Oct 12 16:11:14 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 12 Oct 2022 16:11:14 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 01:31:45 GMT, Joshua Cao wrote: > Example: > > > [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello > CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true > 223 12 3 java.lang.String::length (11 bytes) > 405 307 4 java.lang.String::length (11 bytes) > hello world > > > Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. > > --- > > Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. > > I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java line 71: > 69: } > 70: > 71: private static void testFunctional(boolean value) throws Exception { Since this test has been there, why don't you adapt it to test your per-method PrintCompilation? ------------- PR: https://git.openjdk.org/jdk/pull/10668 From vladimir.x.ivanov at oracle.com Wed Oct 12 17:12:54 2022 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 12 Oct 2022 10:12:54 -0700 Subject: Point where code installed by C1 and C2 is loaded? In-Reply-To: References: Message-ID: <50c72ad1-6961-4fc6-2fea-cf8e1f32e87b@oracle.com> At runtime, every method is invoked either through Method::_from_compiled_entry or Method::_from_interpreted_entry (depending on where the call is performed from). As part of nmethod installation (ciEnv::register_method()), entry points of the relevant method (the root of the compilation) are updated (by Method::set_code()) and Method::_from_compiled_entry starts to point to the entry point of the freshly installed nmethod. (Also, there is a special case of OSR compilation, but I don't cover it here.) src/hotspot/share/oops/method.hpp [1]: class Method : public Metadata { ... // Entry point for calling from compiled code, to compiled code if it exists // or else the interpreter. volatile address _from_compiled_entry; // Cache of: _code ? _code->entry_point() : _adapter->c2i_entry() // The entry point for calling both from and to compiled code is // "_code->entry_point()". Because of tiered compilation and de-opt, this // field can come and go. It can transition from NULL to not-null at any // time (whenever a compile completes). It can transition from not-null to // NULL only at safepoints (because of a de-opt). CompiledMethod* volatile _code; // Points to the corresponding piece of native code volatile address _from_interpreted_entry; // Cache of _code ? _adapter->i2c_entry() : _i2i_entry Hp Best regards, Vladimir Ivanov [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/method.hpp#L109-L118 On 10/12/22 08:18, Julian Waters wrote: > Hi all, > > I apologise if this is a silly question, but where exactly is code > installed by C1 and C2 loaded and executed by the runtime? I've tried > looking through the entirety of hotspot, but haven't found anything that > seems to be related. All I can surmise is that the compiler interface > ultimately creates an nmethod?that allocates itself on the CodeCache > using a CodeBuffer containing C1 or C2 emitted instructions, and then > Method::set_code sets the method's _code to reference that entry in the > cache (or?method->method_holder()->add_osr_nmethod(nm); is called in > other circumstances, I don't quite understand what the difference is but > I assume the end result is probably the same). Given my rudimentary > understanding of hotspot's execution pipeline I'd assume that when a new > frame (frame.hpp) is created, the frame's code blob would be set to > reference the nmethod?in the method that was called, or otherwise > somehow jump back to the interpreter if that method hasn't been compiled > yet. But there doesn't seem to be any point where method->code() is > called to load the instructions emitted by either C1 or C2 into a frame, > so where does that happen? > > I guess this is probably more a question of how hotspot runs loaded > programs in general, which seems to me at a glance like it's chaining > assembly in CodeBlobs together and jumping to the next blob/codelet (in > the next frame?) when it's finished, but I can't really figure out where > those?codelets are set for each frame, or how it chooses between one > compiled by C1 or C2, and the handwritten assembly codelets that make up > the interpreter (or for that matter how it even finds the correct > interpreter?codelet). > > I appreciate any help with this query, sorry if this isn't the correct > list to post this question to, but it seemed like the most appropriate. > > best regards, > Julian From dnsimon at openjdk.org Wed Oct 12 17:20:01 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 12 Oct 2022 17:20:01 GMT Subject: RFR: 8295225: [JVMCI] codeStart should be cleared when entryPoint is cleared Message-ID: When the `InstalledCode.entryPoint` field is [cleared](https://github.com/openjdk/jdk/search?q=set_InstalledCode_entryPoint), the `HotSpotInstalledCode.codeStart` field should also be cleared. That is, when making an nmethod non-entrant, all Java fields pointing to code in the nmethod should be cleared. This avoids an inconsistent view of the code. ------------- Commit messages: - clear codeStart when entryPoint is cleared Changes: https://git.openjdk.org/jdk/pull/10682/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10682&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295225 Stats: 6 lines in 3 files changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10682.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10682/head:pull/10682 PR: https://git.openjdk.org/jdk/pull/10682 From kvn at openjdk.org Wed Oct 12 17:20:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Oct 2022 17:20:07 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: <5IRZaWg2_zpcI95aHeI77y_QlUwg7-A9zFBHk6qwLL8=.3cbe4abc-3c39-4534-82ab-6af4e5243942@github.com> References: <5IRZaWg2_zpcI95aHeI77y_QlUwg7-A9zFBHk6qwLL8=.3cbe4abc-3c39-4534-82ab-6af4e5243942@github.com> Message-ID: On Wed, 12 Oct 2022 15:28:16 GMT, Christian Hagedorn wrote: >> I've kept it this way because the timer was started after printing prior to this change as well. If a dev is benchmarking compilation time, it is better to not include the timer. Printing adds noise that is not there during real-world runs. > > Okay, I don't have a strong opinion about it. In the old code, we somehow did it in-between: We printed UL, then with `PrintCompilation`, then started the timer and then printed to `CompilationLog::log()`. So, I'm not sure what the original intention was. Maybe someone else can comment on that, too. `elapsedTimer time;` just declares variable. `TraceTime` uses it (with start/stop) later to accumulate time. So declaration can be placed anywhere before JVMCI code below. But I think you should be more flexible here. Consider the case when you hit `assert` at the line 2120. Before this change you will see `PrintCompilation` line, with change - you do not. Of cause in hs_err file you will see it because you did not move `log_compile`. I understand that you need to check directive. I would suggest to keep global `PrintCompilation` code and add variable to mark that line is already printed. Then check it in you code which checks directive to avoid duplicate. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From kvn at openjdk.org Wed Oct 12 17:20:09 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Oct 2022 17:20:09 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 16:08:48 GMT, Xin Liu wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java line 71: > >> 69: } >> 70: >> 71: private static void testFunctional(boolean value) throws Exception { > > Since this test has been there, why don't you adapt it to test your per-method PrintCompilation? I agree. We should test new functionality. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From never at openjdk.org Wed Oct 12 17:29:13 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 12 Oct 2022 17:29:13 GMT Subject: RFR: 8295225: [JVMCI] codeStart should be cleared when entryPoint is cleared In-Reply-To: References: Message-ID: <8VhrQKqtxa_2SXc5sFqOE72YOe6vUOOay-AcXk0F_98=.2b506b89-83b2-4bd1-a937-328f66370c6d@github.com> On Wed, 12 Oct 2022 17:12:18 GMT, Doug Simon wrote: > When the `InstalledCode.entryPoint` field is [cleared](https://github.com/openjdk/jdk/search?q=set_InstalledCode_entryPoint), the `HotSpotInstalledCode.codeStart` field should also be cleared. That is, when making an nmethod non-entrant, all Java fields pointing to code in the nmethod should be cleared. This avoids an inconsistent view of the code. Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10682 From vladimir.kozlov at oracle.com Wed Oct 12 18:17:37 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 12 Oct 2022 11:17:37 -0700 Subject: [EXTERNAL]RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <0bc75ee6-641f-1145-8fde-6d11e2ec887e@amazon.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> <0bc75ee6-641f-1145-8fde-6d11e2ec887e@amazon.com> Message-ID: <1da07de9-90d2-d4ad-188e-d7d976009f52@oracle.com> On 10/12/22 7:58 AM, Liu, Xin wrote: > hi, Vladimir, >> You should show that your implementation can rematirealize an object > at any escape site. > > My understanding is I suppose to 'materialize' an object at any escape site. Words ;^) Yes, I mistyped and misspelled. Vladimir K > > 'rematerialize' refers to 'create an scalar-replaced object on heap' in > deoptimization. It's for interpreter as if the object was created in the > first place. It doesn't apply to an escaped object because it's marked > 'GlobalEscaped' in C2 EA. > > > Okay. I will try this idea! > > thanks, > --lx > > > > > On 10/11/22 3:12 PM, Vladimir Kozlov wrote: >> Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore. From duke at openjdk.org Wed Oct 12 18:54:09 2022 From: duke at openjdk.org (Joshua Cao) Date: Wed, 12 Oct 2022 18:54:09 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: <5IRZaWg2_zpcI95aHeI77y_QlUwg7-A9zFBHk6qwLL8=.3cbe4abc-3c39-4534-82ab-6af4e5243942@github.com> Message-ID: <61NtkSyfKGssnMw900te0rLvloG-k43PCsY7KqsNiR4=.a9c2f4fe-fb2d-445d-97ea-02b0dc762482@github.com> On Wed, 12 Oct 2022 17:15:56 GMT, Vladimir Kozlov wrote: >> Okay, I don't have a strong opinion about it. In the old code, we somehow did it in-between: We printed UL, then with `PrintCompilation`, then started the timer and then printed to `CompilationLog::log()`. So, I'm not sure what the original intention was. Maybe someone else can comment on that, too. > > `elapsedTimer time;` just declares variable. `TraceTime` uses it (with start/stop) later to accumulate time. > So declaration can be placed anywhere before JVMCI code below. > > But I think you should be more flexible here. Consider the case when you hit `assert` at the line 2120. Before this change you will see `PrintCompilation` line, with change - you do not. Of cause in hs_err file you will see it because you did not move `log_compile`. I understand that you need to check directive. I would suggest to keep global `PrintCompilation` code and add variable to mark that line is already printed. Then check it in you code which checks directive to avoid duplicate. I will move `elapsedTimer` to top. No strong opinion from myself either. > I would suggest to keep global PrintCompilation code and add variable to mark that line is already printed. Then check it in you code which checks directive to avoid duplicate. I don't understand this. Doesn't this just print everything? --- We can do method specific compilation before the assert. This change seems to be working: diff --git a/src/hotspot/share/compiler/compileBroker.cpp b/src/hotspot/share/compiler/compileBroker.cpp index 8d0a87afc17..181dd28ba45 100644 --- a/src/hotspot/share/compiler/compileBroker.cpp +++ b/src/hotspot/share/compiler/compileBroker.cpp @@ -2093,6 +2093,7 @@ CompilerDirectives* DirectivesStack::_bottom = NULL; // void CompileBroker::invoke_compiler_on_method(CompileTask* task) { task->print_ul(); + elapsedTimer time; CompilerThread* thread = CompilerThread::current(); ResourceMark rm(thread); @@ -2117,11 +2118,16 @@ void CompileBroker::invoke_compiler_on_method(CompileTask* task) { // native. The NoHandleMark before the transition should catch // any cases where this occurs in the future. methodHandle method(thread, task->method()); - assert(!method->is_native(), "no longer compile natives"); // Look up matching directives directive = DirectivesStack::getMatchingDirective(method, comp); task->set_directive(directive); + if (task->directive()->PrintCompilationOption) { + ResourceMark rm; + task->print_tty(); + } + + assert(!method->is_native(), "no longer compile natives"); // Update compile information when using perfdata. if (UsePerfData) { @@ -2131,12 +2137,6 @@ void CompileBroker::invoke_compiler_on_method(CompileTask* task) { DTRACE_METHOD_COMPILE_BEGIN_PROBE(method, compiler_name(task_level)); } - if (task->directive()->PrintCompilationOption) { - ResourceMark rm; - task->print_tty(); - } - elapsedTimer time; - should_break = directive->BreakAtCompileOption || task->check_break_at_flags(); if (should_log && !directive->LogOption) { should_log = false; Although there is still possible that there is failed assertion before printing, i.e. when calling `CompilerThread::current()`. I feel like its ok to just look at hs_err in these cases. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From duke at openjdk.org Wed Oct 12 19:13:06 2022 From: duke at openjdk.org (Joshua Cao) Date: Wed, 12 Oct 2022 19:13:06 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 17:16:29 GMT, Vladimir Kozlov wrote: >> test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java line 71: >> >>> 69: } >>> 70: >>> 71: private static void testFunctional(boolean value) throws Exception { >> >> Since this test has been there, why don't you adapt it to test your per-method PrintCompilation? > > I agree. We should test new functionality. I can add new test, but I do not think BooleanTest can be adapted. BooleanTest.java specifically tests boolean flags, and would not work for compile commands. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From kvn at openjdk.org Wed Oct 12 19:18:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Oct 2022 19:18:07 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: <61NtkSyfKGssnMw900te0rLvloG-k43PCsY7KqsNiR4=.a9c2f4fe-fb2d-445d-97ea-02b0dc762482@github.com> References: <5IRZaWg2_zpcI95aHeI77y_QlUwg7-A9zFBHk6qwLL8=.3cbe4abc-3c39-4534-82ab-6af4e5243942@github.com> <61NtkSyfKGssnMw900te0rLvloG-k43PCsY7KqsNiR4=.a9c2f4fe-fb2d-445d-97ea-02b0dc762482@github.com> Message-ID: On Wed, 12 Oct 2022 18:51:43 GMT, Joshua Cao wrote: >> `elapsedTimer time;` just declares variable. `TraceTime` uses it (with start/stop) later to accumulate time. >> So declaration can be placed anywhere before JVMCI code below. >> >> But I think you should be more flexible here. Consider the case when you hit `assert` at the line 2120. Before this change you will see `PrintCompilation` line, with change - you do not. Of cause in hs_err file you will see it because you did not move `log_compile`. I understand that you need to check directive. I would suggest to keep global `PrintCompilation` code and add variable to mark that line is already printed. Then check it in you code which checks directive to avoid duplicate. > > I will move `elapsedTimer` to top. No strong opinion from myself either. > >> I would suggest to keep global PrintCompilation code and add variable to mark that line is already printed. Then check it in you code which checks directive to avoid duplicate. > > I don't understand this. Doesn't this just print everything? > > --- > > We can do method specific compilation before the assert. This change seems to be working: > > > diff --git a/src/hotspot/share/compiler/compileBroker.cpp b/src/hotspot/share/compiler/compileBroker.cpp > index 8d0a87afc17..181dd28ba45 100644 > --- a/src/hotspot/share/compiler/compileBroker.cpp > +++ b/src/hotspot/share/compiler/compileBroker.cpp > @@ -2093,6 +2093,7 @@ CompilerDirectives* DirectivesStack::_bottom = NULL; > // > void CompileBroker::invoke_compiler_on_method(CompileTask* task) { > task->print_ul(); > + elapsedTimer time; > > CompilerThread* thread = CompilerThread::current(); > ResourceMark rm(thread); > @@ -2117,11 +2118,16 @@ void CompileBroker::invoke_compiler_on_method(CompileTask* task) { > // native. The NoHandleMark before the transition should catch > // any cases where this occurs in the future. > methodHandle method(thread, task->method()); > - assert(!method->is_native(), "no longer compile natives"); > > // Look up matching directives > directive = DirectivesStack::getMatchingDirective(method, comp); > task->set_directive(directive); > + if (task->directive()->PrintCompilationOption) { > + ResourceMark rm; > + task->print_tty(); > + } > + > + assert(!method->is_native(), "no longer compile natives"); > > // Update compile information when using perfdata. > if (UsePerfData) { > @@ -2131,12 +2137,6 @@ void CompileBroker::invoke_compiler_on_method(CompileTask* task) { > DTRACE_METHOD_COMPILE_BEGIN_PROBE(method, compiler_name(task_level)); > } > > - if (task->directive()->PrintCompilationOption) { > - ResourceMark rm; > - task->print_tty(); > - } > - elapsedTimer time; > - > should_break = directive->BreakAtCompileOption || task->check_break_at_flags(); > if (should_log && !directive->LogOption) { > should_log = false; > > > Although there is still possible that there is failed assertion before printing, i.e. when calling `CompilerThread::current()`. I feel like its ok to just look at hs_err in these cases. May be we are looking the wrong way. Can we do `task->set_directive(directive)` when we create `task`? Then you don't need to move PrintCompilation code - `task->directive()` will be available at the beginning of this method. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From kvn at openjdk.org Wed Oct 12 19:24:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Oct 2022 19:24:05 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 19:09:25 GMT, Joshua Cao wrote: >> I agree. We should test new functionality. > > I can add new test, but I do not think BooleanTest can be adapted. BooleanTest.java specifically tests boolean flags, and would not work for compile commands. Yes to new test. But we still need to test global `-XX:+PrintCompilation` ------------- PR: https://git.openjdk.org/jdk/pull/10668 From vlivanov at openjdk.org Wed Oct 12 22:18:16 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 12 Oct 2022 22:18:16 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 16:50:28 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix x86 tests. src/hotspot/share/opto/escape.cpp line 522: > 520: if (child->is_Store()) { > 521: assert(child->in(0) != NULL || m->in(0) != NULL, "No control for store or AddP."); > 522: if (_igvn->is_dominator(merge_phi_region, child->in(0) != NULL ? child->in(0) : m->in(0))) { `PhaseIterGVN::is_dominator()` is a conservative approximation and can result in false negatives. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From jbhateja at openjdk.org Thu Oct 13 07:26:06 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 13 Oct 2022 07:26:06 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: <_wyFWAET_qXwwj-9Iq9AsPAGbT3AXIwN6HujmwZVRPw=.9c652886-4255-4c03-89d9-e3c74f9f319a@github.com> On Mon, 19 Sep 2022 08:51:24 GMT, Xiaohong Gong wrote: > "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. > > This patch adds the vector intrinsic implementation of it. The steps are: > > 1) Load the const "iota" vector. > > We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. > > 2) Compute indexes with "`vec + iota * scale`" > > Here is the performance result to the new added micro benchmark on ARM NEON: > > Benchmark Gain > IndexVectorBenchmark.byteIndexVector 1.477 > IndexVectorBenchmark.doubleIndexVector 5.031 > IndexVectorBenchmark.floatIndexVector 5.342 > IndexVectorBenchmark.intIndexVector 5.529 > IndexVectorBenchmark.longIndexVector 3.177 > IndexVectorBenchmark.shortIndexVector 5.841 > > > Please help to review and share the feedback! Thanks in advance! src/hotspot/share/opto/vectorIntrinsics.cpp line 2949: > 2947: } else if (elem_bt == T_DOUBLE) { > 2948: iota = gvn().transform(new VectorCastL2XNode(iota, vt)); > 2949: } Since we are loading constants from stub initialized memory locations, defining new stubs for floating point iota indices may eliminate need for costly conversion instructions. Specially on X86 conversion b/w Long and Double is only supported by AVX512DQ targets and intrinsification may fail for legacy targets. src/hotspot/share/opto/vectorIntrinsics.cpp line 2978: > 2976: case T_DOUBLE: { > 2977: scale = gvn().transform(new ConvI2LNode(scale)); > 2978: scale = gvn().transform(new ConvL2DNode(scale)); Prior target support check for these IR nodes may prevent surprises in the backend. src/hotspot/share/opto/vectorIntrinsics.cpp line 2978: > 2976: case T_DOUBLE: { > 2977: scale = gvn().transform(new ConvI2LNode(scale)); > 2978: scale = gvn().transform(new ConvL2DNode(scale)); Any specific reason for not directly using ConvI2D for double case. ------------- PR: https://git.openjdk.org/jdk/pull/10332 From xgong at openjdk.org Thu Oct 13 07:32:07 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 13 Oct 2022 07:32:07 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: <_wyFWAET_qXwwj-9Iq9AsPAGbT3AXIwN6HujmwZVRPw=.9c652886-4255-4c03-89d9-e3c74f9f319a@github.com> References: <_wyFWAET_qXwwj-9Iq9AsPAGbT3AXIwN6HujmwZVRPw=.9c652886-4255-4c03-89d9-e3c74f9f319a@github.com> Message-ID: On Thu, 13 Oct 2022 07:18:24 GMT, Jatin Bhateja wrote: >> "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. >> >> This patch adds the vector intrinsic implementation of it. The steps are: >> >> 1) Load the const "iota" vector. >> >> We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. >> >> 2) Compute indexes with "`vec + iota * scale`" >> >> Here is the performance result to the new added micro benchmark on ARM NEON: >> >> Benchmark Gain >> IndexVectorBenchmark.byteIndexVector 1.477 >> IndexVectorBenchmark.doubleIndexVector 5.031 >> IndexVectorBenchmark.floatIndexVector 5.342 >> IndexVectorBenchmark.intIndexVector 5.529 >> IndexVectorBenchmark.longIndexVector 3.177 >> IndexVectorBenchmark.shortIndexVector 5.841 >> >> >> Please help to review and share the feedback! Thanks in advance! > > src/hotspot/share/opto/vectorIntrinsics.cpp line 2978: > >> 2976: case T_DOUBLE: { >> 2977: scale = gvn().transform(new ConvI2LNode(scale)); >> 2978: scale = gvn().transform(new ConvL2DNode(scale)); > > Any specific reason for not directly using ConvI2D for double case. Good catch, I think it's ok to use ConvI2D here. I will change this. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10332 From chagedorn at openjdk.org Thu Oct 13 07:37:10 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 13 Oct 2022 07:37:10 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: <5IRZaWg2_zpcI95aHeI77y_QlUwg7-A9zFBHk6qwLL8=.3cbe4abc-3c39-4534-82ab-6af4e5243942@github.com> <61NtkSyfKGssnMw900te0rLvloG-k43PCsY7KqsNiR4=.a9c2f4fe-fb2d-445d-97ea-02b0dc762482@github.com> Message-ID: <3nLwjSVin4fM6Cr-hwO6o2uppYGmdKFfYRfihaYoT6c=.8951bc7f-f1de-49e3-ad1a-ddbdd80e3209@github.com> On Wed, 12 Oct 2022 19:14:12 GMT, Vladimir Kozlov wrote: >> I will move `elapsedTimer` to top. No strong opinion from myself either. >> >>> I would suggest to keep global PrintCompilation code and add variable to mark that line is already printed. Then check it in you code which checks directive to avoid duplicate. >> >> I don't understand this. Doesn't this just print everything? >> >> --- >> >> We can do method specific compilation before the assert. This change seems to be working: >> >> >> diff --git a/src/hotspot/share/compiler/compileBroker.cpp b/src/hotspot/share/compiler/compileBroker.cpp >> index 8d0a87afc17..181dd28ba45 100644 >> --- a/src/hotspot/share/compiler/compileBroker.cpp >> +++ b/src/hotspot/share/compiler/compileBroker.cpp >> @@ -2093,6 +2093,7 @@ CompilerDirectives* DirectivesStack::_bottom = NULL; >> // >> void CompileBroker::invoke_compiler_on_method(CompileTask* task) { >> task->print_ul(); >> + elapsedTimer time; >> >> CompilerThread* thread = CompilerThread::current(); >> ResourceMark rm(thread); >> @@ -2117,11 +2118,16 @@ void CompileBroker::invoke_compiler_on_method(CompileTask* task) { >> // native. The NoHandleMark before the transition should catch >> // any cases where this occurs in the future. >> methodHandle method(thread, task->method()); >> - assert(!method->is_native(), "no longer compile natives"); >> >> // Look up matching directives >> directive = DirectivesStack::getMatchingDirective(method, comp); >> task->set_directive(directive); >> + if (task->directive()->PrintCompilationOption) { >> + ResourceMark rm; >> + task->print_tty(); >> + } >> + >> + assert(!method->is_native(), "no longer compile natives"); >> >> // Update compile information when using perfdata. >> if (UsePerfData) { >> @@ -2131,12 +2137,6 @@ void CompileBroker::invoke_compiler_on_method(CompileTask* task) { >> DTRACE_METHOD_COMPILE_BEGIN_PROBE(method, compiler_name(task_level)); >> } >> >> - if (task->directive()->PrintCompilationOption) { >> - ResourceMark rm; >> - task->print_tty(); >> - } >> - elapsedTimer time; >> - >> should_break = directive->BreakAtCompileOption || task->check_break_at_flags(); >> if (should_log && !directive->LogOption) { >> should_log = false; >> >> >> Although there is still possible that there is failed assertion before printing, i.e. when calling `CompilerThread::current()`. I feel like its ok to just look at hs_err in these cases. > > May be we are looking the wrong way. Can we do `task->set_directive(directive)` when we create `task`? > Then you don't need to move PrintCompilation code - `task->directive()` will be available at the beginning of this method. > elapsedTimer time; just declares variable. TraceTime uses it (with start/stop) later to accumulate time. So declaration can be placed anywhere before JVMCI code below. Got it, thanks for the clarification - then it does not matter much where we place this declaration. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From bulasevich at openjdk.org Thu Oct 13 07:47:09 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Oct 2022 07:47:09 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v4] In-Reply-To: References: Message-ID: On Thu, 22 Sep 2022 21:18:48 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup > > src/hotspot/share/code/compressedStream.cpp line 152: > >> 150: } >> 151: >> 152: int CompressedSparseDataWriteStream::position() { > > The function with a side effect looks strange to me. I see an assert in `DebugInformationRecorder::DebugInformationRecorder(OopRecorder* oop_recorder)` which uses it for checking. So the assert can cause side affects. I am not sure it is expected. Storing the debug info is an iterative process. Chunks of data are compared to avoid duplication, and on some points the generated data is discarded and position is unrolled. Besides read/write stream implementation internals, DebugInformationRecorder uses raw stream data access to track the similar chunks (see [DIR_Chunk](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.cpp#L57)) and [memcpy](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.cpp#L1969) the raw data. We have either to (1) align the data on positions where the DebugInformationRecorder splits data into chunks or to (2) take the bit position into account. I experimented with `int position` to contain both the stream bit position in the least significant bits and stream byte position in most significant bits. For me the code becomes less readable and the performance is questionable even without the DebugInformationRecorder update: - uint8_t b1 = _buffer[_position] << _bit_pos; - uint8_t b2 = _buffer[++_position] >> (8 - _bit_pos); + uint8_t b1 = _buffer[_position >> 3] << (_position & 0x7); + _position += 8; + uint8_t b2 = _buffer[_position >> 3] >> (8 - _position & 0x7); I would avoid this change and stay with current implementation. In fact, there is not much aligned positions within the data. And `assert(_stream->position() > serialized_null, "sanity");` (thanks for noticing that!) in the constructor makes no problem because data is aligned at the beginning of the stream. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From luhenry at openjdk.org Thu Oct 13 07:49:47 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Thu, 13 Oct 2022 07:49:47 GMT Subject: RFR: 8295262: Build binutils out of source tree Message-ID: Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. ------------- Commit messages: - 8295262: Build binutils out of source tree Changes: https://git.openjdk.org/jdk/pull/10689/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10689&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295262 Stats: 32 lines in 2 files changed: 18 ins; 2 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/10689.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10689/head:pull/10689 PR: https://git.openjdk.org/jdk/pull/10689 From bulasevich at openjdk.org Thu Oct 13 07:51:14 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Oct 2022 07:51:14 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v2] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 20:34:56 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - warning fix and name fix >> - optimize the encoding >> - fix >> - 8293170: Improve encoding of the debuginfo nmethod section > > src/hotspot/share/compiler/oopMap.hpp line 377: > >> 375: OopMapValue current() { return _omv; } >> 376: #ifdef ASSERT >> 377: int stream_position() { return _stream.position(); } > > This change is an example that something is wrong with the design. > There is a concrete class `CompressedReadStream` with expected behaviour of `position`: no changes to `_stream`. We have to break this contract to be able to compile `OopMapStream`. This change is unrolled as CompressedSparseDataReadStream is a separate class now (rework to avoid the virtual functions) ------------- PR: https://git.openjdk.org/jdk/pull/10025 From shade at openjdk.org Thu Oct 13 08:10:06 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 13 Oct 2022 08:10:06 GMT Subject: RFR: 8295262: Build binutils out of source tree In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:43:33 GMT, Ludovic Henry wrote: > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. This generally looks good. I stumbled upon this issue when cross-compiling hsdis too! But we need some cleanups, I think: make/autoconf/lib-hsdis.m4 line 184: > 182: else > 183: binutils_cc="$CC $SYSROOT_CFLAGS" > 184: if test "x$COMPILE_TYPE" = xcross; then I am surprised this PR does not have a merge conflict against current mainline which already has this block. Can you please merge the fresh master to your PR? src/utils/hsdis/binutils/hsdis-binutils.c line 571: > 569: dinfo->disassembler_options = > 570: disassembler_options != NULL && disassembler_options[0] != '\0' ? > 571: disassembler_options : NULL; This hunk looks irrelevant to the issue? ------------- Changes requested by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10689 From jbhateja at openjdk.org Thu Oct 13 08:18:57 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 13 Oct 2022 08:18:57 GMT Subject: RFR: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 In-Reply-To: References: Message-ID: <4ukg7auIVhF3ivUj1BQuZ2nSS99tAZakzTHFz5E5eIk=.5227a1b3-815a-4578-b1c3-7020f446227e@github.com> On Tue, 11 Oct 2022 20:40:19 GMT, Dean Long wrote: > JDK-8295028 Hi @vnkozlov , since vector API makes aggressive use of abstract vectors chances of an abstract vector parameter type holding a speculative concrete types are higher. ------------- PR: https://git.openjdk.org/jdk/pull/10648 From dnsimon at openjdk.org Thu Oct 13 08:37:19 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Oct 2022 08:37:19 GMT Subject: Integrated: 8295225: [JVMCI] codeStart should be cleared when entryPoint is cleared In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 17:12:18 GMT, Doug Simon wrote: > When the `InstalledCode.entryPoint` field is [cleared](https://github.com/openjdk/jdk/search?q=set_InstalledCode_entryPoint), the `HotSpotInstalledCode.codeStart` field should also be cleared. That is, when making an nmethod non-entrant, all Java fields pointing to code in the nmethod should be cleared. This avoids an inconsistent view of the code. This pull request has now been integrated. Changeset: 03e63a2b Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/03e63a2b87e1bef6025722ec9a016312c55ebd81 Stats: 6 lines in 3 files changed: 6 ins; 0 del; 0 mod 8295225: [JVMCI] codeStart should be cleared when entryPoint is cleared Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/10682 From luhenry at openjdk.org Thu Oct 13 08:41:35 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Thu, 13 Oct 2022 08:41:35 GMT Subject: RFR: 8295262: Build binutils out of source tree [v2] In-Reply-To: References: Message-ID: > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Remove unrelated change - Merge branch 'master' of github.com:openjdk/jdk into dev/ludovic/upstream-hsdis-cross-compile - 8295262: Build binutils out of source tree Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10689/files - new: https://git.openjdk.org/jdk/pull/10689/files/b69fdbaa..8faf5083 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10689&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10689&range=00-01 Stats: 14194 lines in 298 files changed: 10379 ins; 2288 del; 1527 mod Patch: https://git.openjdk.org/jdk/pull/10689.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10689/head:pull/10689 PR: https://git.openjdk.org/jdk/pull/10689 From luhenry at openjdk.org Thu Oct 13 08:41:36 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Thu, 13 Oct 2022 08:41:36 GMT Subject: RFR: 8295262: Build binutils out of source tree [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:56:09 GMT, Aleksey Shipilev wrote: >> Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Remove unrelated change >> - Merge branch 'master' of github.com:openjdk/jdk into dev/ludovic/upstream-hsdis-cross-compile >> - 8295262: Build binutils out of source tree >> >> Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. >> >> The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. > > src/utils/hsdis/binutils/hsdis-binutils.c line 571: > >> 569: dinfo->disassembler_options = >> 570: disassembler_options != NULL && disassembler_options[0] != '\0' ? >> 571: disassembler_options : NULL; > > This hunk looks irrelevant to the issue? I needed it to successfully run hsdis. I'll remove it, check if it's still necessary, and potentially submit it in another PR. ------------- PR: https://git.openjdk.org/jdk/pull/10689 From bulasevich at openjdk.org Thu Oct 13 09:38:31 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Oct 2022 09:38:31 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v4] In-Reply-To: References: <3kOvEAlksouNjqXDcn3XNuJj97kx3uhj8UzlmZIYq_o=.517b466d-4577-4c11-b5d9-7709176136cf@github.com> Message-ID: On Tue, 11 Oct 2022 15:17:24 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup > > src/hotspot/share/code/compressedStream.cpp line 198: > >> 196: if (nsize < min_expansion*2) { >> 197: nsize = min_expansion*2; >> 198: } > > We will not need the code, if we initialise `_size` to `max2(initial_size, UNSIGNED5::MAX_LENGTH)` in the constructor. I don't think `initial_size` less than `UNSIGNED5::MAX_LENGTH` makes sense. > > `grow()` is invoked when `_position >= _size`. So there are two cases: > 1. `_position == _size` > 2. `_position > _size` > > `_position < 2 * _size` will be satisfied for case 1. > How do you guarantee `_position < 2 * _size` for case 2? I agree. Theoretically, one can call set_position() to something far outside the buffer capabilities, and get the write() to store data far beyond the buffer. As it does not happen really, let me add assert(nsize > _position, "sanity") here. Also I remove (nsize < 2 * UNSIGNED5::MAX_LENGTH) check as the DebugInfo buffer [start size is 10K](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.cpp#L130). By the way, 10K seems too big. It covers 999 cases out of 1000. > src/hotspot/share/code/compressedStream.hpp line 119: > >> 117: u_char* _buffer; >> 118: int _position; // current byte offset >> 119: size_t _byte_pos {0}; // current bit offset > > Is it a bit offset in the byte at `_position`? > `_byte_pos` does not sound clear. Yes. I rename _byte_pos to _bit_pos ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Thu Oct 13 09:42:54 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Oct 2022 09:42:54 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v5] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: cleanup and rename ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/c2054359..a461a10e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=03-04 Stats: 18 lines in 2 files changed: 0 ins; 3 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From fgao at openjdk.org Thu Oct 13 10:03:24 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 13 Oct 2022 10:03:24 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v3] In-Reply-To: References: Message-ID: > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > > > But we don't support the case like: > > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) { > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > > because the IR nodes for the CMoveD in the loop is: > > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node > as commented in CMoveKit::make_cmovevd_pack(). Because > we can't vectorize the CMoveD pack, we shouldn't vectorize > its inputs, AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD > node with two vector inputs, AddVD and SubVD, which has > wrong mixing types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack > from the packset and clear related map info. Fei Gao has updated the pull request incrementally with one additional commit since the last revision: Clean up the code style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10627/files - new: https://git.openjdk.org/jdk/pull/10627/files/a113675d..47ca7341 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10627&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10627&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10627.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10627/head:pull/10627 PR: https://git.openjdk.org/jdk/pull/10627 From fgao at openjdk.org Thu Oct 13 10:03:26 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 13 Oct 2022 10:03:26 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v2] In-Reply-To: <4tYsZFZ4og659mb3lTR-63DgWfw856o_Q5y9LoJp90k=.b9b4fc64-22e7-476e-b5da-69827986fe0a@github.com> References: <4tYsZFZ4og659mb3lTR-63DgWfw856o_Q5y9LoJp90k=.b9b4fc64-22e7-476e-b5da-69827986fe0a@github.com> Message-ID: On Wed, 12 Oct 2022 13:33:42 GMT, Christian Hagedorn wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Refine the function and clean up the code style >> - Merge branch 'master' into fg8293833 >> - 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov >> >> After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize >> the case below by enabling -XX:+UseCMoveUnconditionally and >> -XX:+UseVectorCmov: >> ``` >> // double[] a, double[] b, double[] c; >> for (int i = 0; i < a.length; i++) { >> c[i] = (a[i] > b[i]) ? a[i] : b[i]; >> } >> ``` >> >> But we don't support the case like: >> ``` >> // double[] a; >> // int seed; >> for (int i = 0; i < a.length; i++) >> a[i] = (i % 2 == 0) ? seed + i : seed - i; >> } >> ``` >> because the IR nodes for the CMoveD in the loop is: >> ``` >> AddI AndI AddD SubD >> \ / / / >> CmpI / / >> \ / / >> Bool / / >> \ / / >> CMoveD >> ``` >> >> and it is not our target pattern, which requires that the inputs >> of Cmp node must be the same as the inputs of CMove node as >> commented in CMoveKit::make_cmovevd_pack(). Because we can't >> vectorize the CMoveD pack, we shouldn't vectorize its inputs, >> AddD and SubD. But the current function >> CMoveKit::make_cmovevd_pack() doesn't clear the unqualified >> CMoveD pack from the packset. In this way, superword wrongly >> vectorizes AddD and SubD. Finally, we get a scalar CMoveD node >> with two vector inputs, AddVD and SubVD, which has wrong mixing >> types, then the assertion fails. >> >> To fix it, we need to remove the unvectorized CMoveD pack from >> the packset and clear related map info. > > Thanks for doing the update, looks good to me! It's indeed much easier to follow the code now. I'll submit some testing. @chhagedorn thanks for your review and comments! I updated the commit to resolve the code style issue. > src/hotspot/share/opto/superword.cpp line 1936: > >> 1934: >> 1935: bool CMoveKit::is_cmove_pack_candidate(Node_List* cmove_pk) { >> 1936: Node *cmove = cmove_pk->at(0); > > Suggestion: > > Node* cmove = cmove_pk->at(0); Done. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10627 From bkilambi at openjdk.org Thu Oct 13 10:12:42 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 13 Oct 2022 10:12:42 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v2] In-Reply-To: References: Message-ID: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Modified JTREG test to include feature constraints ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10407/files - new: https://git.openjdk.org/jdk/pull/10407/files/b2de6107..6df4f014 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=00-01 Stats: 8 lines in 1 file changed: 0 ins; 5 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10407.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10407/head:pull/10407 PR: https://git.openjdk.org/jdk/pull/10407 From bkilambi at openjdk.org Thu Oct 13 10:12:44 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 13 Oct 2022 10:12:44 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v2] In-Reply-To: References: Message-ID: <9_Z5FH23oQ3RdzoP-FiwjNOTILIE_z6iNY___tz9g30=.ed02336f-c9c0-4c15-9d62-1a5a0600fdda@github.com> On Wed, 12 Oct 2022 10:46:17 GMT, Hao Sun wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Modified JTREG test to include feature constraints > > test/hotspot/jtreg/compiler/vectorization/TestEor3AArch64.java line 38: > >> 36: * @summary Test EOR3 Neon/SVE2 instruction for aarch64 SHA3 extension >> 37: * @library /test/lib / >> 38: * @requires os.arch == "aarch64" & vm.cpu.features ~=".*sha3.*" > > Suggestion: > > * @requires os.arch == "aarch64" & vm.cpu.features ~= ".*sha3.*" > > nit: [style] it's better to have one extra space. @shqking Thank you for the comments. I have made the suggested changes. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From duke at openjdk.org Thu Oct 13 10:47:04 2022 From: duke at openjdk.org (Mkkebe) Date: Thu, 13 Oct 2022 10:47:04 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v5] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 09:42:54 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup and rename Marked as reviewed by Mkkebe at github.com (no known OpenJDK username). ------------- PR: https://git.openjdk.org/jdk/pull/10025 From shade at openjdk.org Thu Oct 13 11:52:09 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 13 Oct 2022 11:52:09 GMT Subject: RFR: 8295262: Build binutils out of source tree [v2] In-Reply-To: References: Message-ID: <7LbdPchdt77O0Bh8XQW-x7jNefCcOVCxbXp6VeEufKg=.6e8c1095-454e-47ce-b071-f76915bf274e@github.com> On Thu, 13 Oct 2022 08:41:35 GMT, Ludovic Henry wrote: >> Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. >> >> The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. > > Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Remove unrelated change > - Merge branch 'master' of github.com:openjdk/jdk into dev/ludovic/upstream-hsdis-cross-compile > - 8295262: Build binutils out of source tree > > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. This looks good to me. I tested hsdis builds on these platforms and they pass: server-release-aarch64-linux-gnu server-release-arm-linux-gnueabihf server-release-i686-linux-gnu server-release-powerpc64le-linux-gnu server-release-powerpc64-linux-gnu- server-release-riscv64-linux-gnu server-release-s390x-linux-gnu server-release-x86_64-linux-gnu zero-release-alpha-linux-gnu zero-release-arm-linux-gnueabi zero-release-arm-linux-gnueabihf zero-release-m68k-linux-gnu zero-release-mips64el-linux-gnuabi64 zero-release-mipsel-linux-gnu zero-release-powerpc-linux-gnu zero-release-sh4-linux-gnu zero-release-sparc64-linux-gnu ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10689 From haosun at openjdk.org Thu Oct 13 12:01:08 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 13 Oct 2022 12:01:08 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 10:12:42 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Modified JTREG test to include feature constraints LGTM (I'm not a Reviewer). ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.org/jdk/pull/10407 From shade at openjdk.org Thu Oct 13 12:11:02 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 13 Oct 2022 12:11:02 GMT Subject: RFR: 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts Message-ID: Fails on many platforms, for example arm32: ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/10696/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10696&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295268 Stats: 5 lines in 5 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10696.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10696/head:pull/10696 PR: https://git.openjdk.org/jdk/pull/10696 From chagedorn at openjdk.org Thu Oct 13 12:45:50 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 13 Oct 2022 12:45:50 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases Message-ID: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. ## How does it work? ### Basic idea There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: int iFld; @Test @IR(counts = {IRNode.STORE_I, "1"}, phase = {CompilePhase.AFTER_PARSING, // Fails CompilePhase.ITER_GVN1}) // Works public void optimizeStores() { iFld = 42; iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 } In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > Phase "After Parsing": - counts: Graph contains wrong number of nodes: * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" - Failed comparison: [found] 2 = 1 [given] - Matched nodes (2): * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. ### CompilePhase.DEFAULT - default compile phase The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. ### Different regexes for the same IRNode entry A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; static { String idealIndependentRegex = START + "Allocate" + MID + END; String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; allocNodes(ALLOC, idealIndependentRegex, optoRegex); } **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** ### Using the IRNode entries correctly The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. ## General Changes The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: - Added more packages to better group related classes together. - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) - Cleaned up and refactored a lot of code to use this new design. - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. - Replaced implementation inheritance by interfaces. - Improved encapsulation of object data. - Updated README and many comments/class descriptions to reflect this new feature. - Added new IR framework tests ## Testing - Normal tier testing. - Applying the patch to Valhalla to perform tier testing. - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! Thanks, Christian ------------- Commit messages: - Merge branch 'master' into JDK-8280378 - Fix missing counts indentation in failure messages - Update comments - Fix test failures - Finish reviewing own code - Refactor mapping classes - Apply more cleanups while reviewing own code - Update compile phases and phasetype.hpp - Refactoring raw check attribute parsing classes and actions - Create Compilation class and move the compilation output map out of the IRMethod class, update visitors accordingly - ... and 66 more: https://git.openjdk.org/jdk/compare/7e4868de...347f26e1 Changes: https://git.openjdk.org/jdk/pull/10695/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8280378 Stats: 9416 lines in 150 files changed: 7120 ins; 1572 del; 724 mod Patch: https://git.openjdk.org/jdk/pull/10695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10695/head:pull/10695 PR: https://git.openjdk.org/jdk/pull/10695 From tanksherman27 at gmail.com Thu Oct 13 13:03:13 2022 From: tanksherman27 at gmail.com (Julian Waters) Date: Thu, 13 Oct 2022 21:03:13 +0800 Subject: Point where code installed by C1 and C2 is loaded? In-Reply-To: <50c72ad1-6961-4fc6-2fea-cf8e1f32e87b@oracle.com> References: <50c72ad1-6961-4fc6-2fea-cf8e1f32e87b@oracle.com> Message-ID: Hi Vladimir, thanks for the reply! Looking at set_code it seems to me like the redirection is done in mh->_from_compiled_entry = code->verified_entry_point(); while mh->_from_interpreted_entry = mh->get_i2c_entry(); is done to make execution jump to compiled code if it was running in the interpreter prior, did i get that right? Seems a little strange to me that _code is also stored in the Method if it's not actually used in execution, but I guess it's probably used for other reasons. Given the comments accompanying _from_compiled_entry and _from_interpreted_entry: // Entry point for calling from compiled code, to compiled code if it exists // or else the interpreter. // Cache of _code ? _adapter->i2c_entry() : _i2i_entry // Cache of: _code ? _code->entry_point() : _adapter->c2i_entry() If i understand this correctly, if _code != NULL, _from_interpreted_entry would be _adapter->i2c_entry(), which is(?) a CodeBlob containing assembly to transfer control to the associated nmethod, and vice versa for _from_compiled_entry? I am somewhat curious about where these are actually loaded during frame creation, and how the runtime would know whether what it just received were the contents of an nmethod that could be executed immediately, or a method running in the interpreter that needs to be decoded, dispatched, etc. I appreciate the response explaining method invocation! best regards, Julian On Thu, Oct 13, 2022 at 1:13 AM Vladimir Ivanov < vladimir.x.ivanov at oracle.com> wrote: > At runtime, every method is invoked either through > Method::_from_compiled_entry or Method::_from_interpreted_entry > (depending on where the call is performed from). > > As part of nmethod installation (ciEnv::register_method()), entry points > of the relevant method (the root of the compilation) are updated (by > Method::set_code()) and Method::_from_compiled_entry starts to point to > the entry point of the freshly installed nmethod. > > (Also, there is a special case of OSR compilation, but I don't cover it > here.) > > src/hotspot/share/oops/method.hpp [1]: > > class Method : public Metadata { > ... > // Entry point for calling from compiled code, to compiled code if it > exists > // or else the interpreter. > volatile address _from_compiled_entry; // Cache of: _code ? > _code->entry_point() : _adapter->c2i_entry() > // The entry point for calling both from and to compiled code is > // "_code->entry_point()". Because of tiered compilation and de-opt, > this > // field can come and go. It can transition from NULL to not-null at > any > // time (whenever a compile completes). It can transition from > not-null to > // NULL only at safepoints (because of a de-opt). > CompiledMethod* volatile _code; // Points to > the corresponding piece of native code > volatile address _from_interpreted_entry; // Cache of _code > ? _adapter->i2c_entry() : _i2i_entry > > Hp > > Best regards, > Vladimir Ivanov > > [1] > > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/method.hpp#L109-L118 > > On 10/12/22 08:18, Julian Waters wrote: > > Hi all, > > > > I apologise if this is a silly question, but where exactly is code > > installed by C1 and C2 loaded and executed by the runtime? I've tried > > looking through the entirety of hotspot, but haven't found anything that > > seems to be related. All I can surmise is that the compiler interface > > ultimately creates an nmethod that allocates itself on the CodeCache > > using a CodeBuffer containing C1 or C2 emitted instructions, and then > > Method::set_code sets the method's _code to reference that entry in the > > cache (or method->method_holder()->add_osr_nmethod(nm); is called in > > other circumstances, I don't quite understand what the difference is but > > I assume the end result is probably the same). Given my rudimentary > > understanding of hotspot's execution pipeline I'd assume that when a new > > frame (frame.hpp) is created, the frame's code blob would be set to > > reference the nmethod in the method that was called, or otherwise > > somehow jump back to the interpreter if that method hasn't been compiled > > yet. But there doesn't seem to be any point where method->code() is > > called to load the instructions emitted by either C1 or C2 into a frame, > > so where does that happen? > > > > I guess this is probably more a question of how hotspot runs loaded > > programs in general, which seems to me at a glance like it's chaining > > assembly in CodeBlobs together and jumping to the next blob/codelet (in > > the next frame?) when it's finished, but I can't really figure out where > > those codelets are set for each frame, or how it chooses between one > > compiled by C1 or C2, and the handwritten assembly codelets that make up > > the interpreter (or for that matter how it even finds the correct > > interpreter codelet). > > > > I appreciate any help with this query, sorry if this isn't the correct > > list to post this question to, but it seemed like the most appropriate. > > > > best regards, > > Julian > -------------- next part -------------- An HTML attachment was scrubbed... URL: From erikj at openjdk.org Thu Oct 13 13:36:19 2022 From: erikj at openjdk.org (Erik Joelsson) Date: Thu, 13 Oct 2022 13:36:19 GMT Subject: RFR: 8295262: Build binutils out of source tree [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 08:41:35 GMT, Ludovic Henry wrote: >> Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. >> >> The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. > > Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Remove unrelated change > - Merge branch 'master' of github.com:openjdk/jdk into dev/ludovic/upstream-hsdis-cross-compile > - 8295262: Build binutils out of source tree > > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. Marked as reviewed by erikj (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10689 From shade at openjdk.org Thu Oct 13 15:12:36 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 13 Oct 2022 15:12:36 GMT Subject: RFR: 8295262: Build binutils out of source tree [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 08:41:35 GMT, Ludovic Henry wrote: >> Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. >> >> The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. > > Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Remove unrelated change > - Merge branch 'master' of github.com:openjdk/jdk into dev/ludovic/upstream-hsdis-cross-compile > - 8295262: Build binutils out of source tree > > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. I think we want to ask @magicus about this as well, maybe he will discover some unusual quirks in this code. Otherwise, I'll sponsor. ------------- PR: https://git.openjdk.org/jdk/pull/10689 From qamai at openjdk.org Thu Oct 13 16:53:45 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 13 Oct 2022 16:53:45 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v3] In-Reply-To: References: Message-ID: > Hi, > > This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, > > - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. > - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. > > Vector API benchmark shows the results of `MUL` operations: > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units Change > Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% > Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% > Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% > Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% > Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% > > Please have a look and have some reviews, thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - Merge branch 'master' into improveMulVB - refactor conditions - add vmulB for 8 bytes - Merge branch 'master' into improveMulVB - Merge branch 'master' into improveMulVB - Merge branch 'master' into improveMulVB - fix - mulV ------------- Changes: https://git.openjdk.org/jdk/pull/10571/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10571&range=02 Stats: 170 lines in 5 files changed: 15 ins; 64 del; 91 mod Patch: https://git.openjdk.org/jdk/pull/10571.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10571/head:pull/10571 PR: https://git.openjdk.org/jdk/pull/10571 From vladimir.x.ivanov at oracle.com Thu Oct 13 17:12:59 2022 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 13 Oct 2022 10:12:59 -0700 Subject: Point where code installed by C1 and C2 is loaded? In-Reply-To: References: <50c72ad1-6961-4fc6-2fea-cf8e1f32e87b@oracle.com> Message-ID: <71c8b4e2-c3ae-8fd3-f798-f98bee4eddd4@oracle.com> > Looking at set_code it seems to me like the redirection is done in > > mh->_from_compiled_entry = code->verified_entry_point(); > > while > > mh->_from_interpreted_entry = mh->get_i2c_entry(); > > is done to make execution jump to compiled code if it was running in the > interpreter prior, did i get that right? Seems a little strange to me > that _code is also stored in the Method if it's not actually used in > execution, but I guess it's probably used for other reasons. Given the Method::code() is extensively used to get the nmethod associated with the particular Method instance. It's possible to derive the same information from Method::_from_compiled_entry, but it's much easier to just keep a pointer to the nmethod itself. > comments accompanying _from_compiled_entry and _from_interpreted_entry: > > // Entry point for calling from compiled code, to compiled code if it exists > // or else the interpreter. > > // Cache of _code ? _adapter->i2c_entry() : _i2i_entry > // Cache of: _code ? _code->entry_point() : _adapter->c2i_entry() > > If i understand this correctly, if _code != NULL, > _from_interpreted_entry would be _adapter->i2c_entry(), which is(?) a > CodeBlob containing assembly to transfer control to the associated > nmethod, and vice versa for _from_compiled_entry? I am somewhat curious > about where these are actually loaded during frame creation, and how the > runtime would know whether what it just received were the contents of > an?nmethod?that could be executed immediately, or a method running in > the interpreter that needs to be decoded, dispatched, etc. I'm not sure I fully understood your question. Interpreter and compiled code have different calling conventions, that's the reason why they use distinct entry points. i2c/c2i are frameless adapters which perform conversion between those calling conventions as part of invocation. See SharedRuntime::generate_i2c2i_adapters() for details [1]. Best regards, Vladimir Ivanov [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/sharedRuntime.hpp#L394-L426 > > I appreciate the response explaining method invocation! > > best regards, > Julian > > On Thu, Oct 13, 2022 at 1:13 AM Vladimir Ivanov > > wrote: > > At runtime, every method is invoked either through > Method::_from_compiled_entry or Method::_from_interpreted_entry > (depending on where the call is performed from). > > As part of nmethod installation (ciEnv::register_method()), entry > points > of the relevant method (the root of the compilation) are updated (by > Method::set_code()) and Method::_from_compiled_entry starts to point to > the entry point of the freshly installed nmethod. > > (Also, there is a special case of OSR compilation, but I don't cover it > here.) > > src/hotspot/share/oops/method.hpp [1]: > > class Method : public Metadata { > ... > ? ?// Entry point for calling from compiled code, to compiled code > if it > exists > ? ?// or else the interpreter. > ? ?volatile address _from_compiled_entry;? ? ? ? // Cache of: _code ? > _code->entry_point() : _adapter->c2i_entry() > ? ?// The entry point for calling both from and to compiled code is > ? ?// "_code->entry_point()".? Because of tiered compilation and > de-opt, > this > ? ?// field can come and go.? It can transition from NULL to > not-null at any > ? ?// time (whenever a compile completes).? It can transition from > not-null to > ? ?// NULL only at safepoints (because of a de-opt). > ? ?CompiledMethod* volatile _code;? ? ? ? ? ? ? ? ? ? ? ?// Points to > the corresponding piece of native code > ? ?volatile address? ? ? ? ? ?_from_interpreted_entry; // Cache of > _code > ? _adapter->i2c_entry() : _i2i_entry > > Hp > > Best regards, > Vladimir Ivanov > > [1] > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/method.hpp#L109-L118 > > On 10/12/22 08:18, Julian Waters wrote: > > Hi all, > > > > I apologise if this is a silly question, but where exactly is code > > installed by C1 and C2 loaded and executed by the runtime? I've > tried > > looking through the entirety of hotspot, but haven't found > anything that > > seems to be related. All I can surmise is that the compiler > interface > > ultimately creates an nmethod?that allocates itself on the CodeCache > > using a CodeBuffer containing C1 or C2 emitted instructions, and > then > > Method::set_code sets the method's _code to reference that entry > in the > > cache (or?method->method_holder()->add_osr_nmethod(nm); is called in > > other circumstances, I don't quite understand what the difference > is but > > I assume the end result is probably the same). Given my rudimentary > > understanding of hotspot's execution pipeline I'd assume that > when a new > > frame (frame.hpp) is created, the frame's code blob would be set to > > reference the nmethod?in the method that was called, or otherwise > > somehow jump back to the interpreter if that method hasn't been > compiled > > yet. But there doesn't seem to be any point where method->code() is > > called to load the instructions emitted by either C1 or C2 into a > frame, > > so where does that happen? > > > > I guess this is probably more a question of how hotspot runs loaded > > programs in general, which seems to me at a glance like it's > chaining > > assembly in CodeBlobs together and jumping to the next > blob/codelet (in > > the next frame?) when it's finished, but I can't really figure > out where > > those?codelets are set for each frame, or how it chooses between one > > compiled by C1 or C2, and the handwritten assembly codelets that > make up > > the interpreter (or for that matter how it even finds the correct > > interpreter?codelet). > > > > I appreciate any help with this query, sorry if this isn't the > correct > > list to post this question to, but it seemed like the most > appropriate. > > > > best regards, > > Julian > From vlivanov at openjdk.org Thu Oct 13 18:14:02 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 13 Oct 2022 18:14:02 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v3] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 16:53:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, >> >> - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. >> - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. >> >> Vector API benchmark shows the results of `MUL` operations: >> >> Before After >> Benchmark (size) Mode Cnt Score Error Score Error Units Change >> Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% >> Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% >> Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% >> Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% >> Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% >> >> Please have a look and have some reviews, thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Merge branch 'master' into improveMulVB > - refactor conditions > - add vmulB for 8 bytes > - Merge branch 'master' into improveMulVB > - Merge branch 'master' into improveMulVB > - Merge branch 'master' into improveMulVB > - fix > - mulV Looks good. Marked as reviewed by vlivanov (Reviewer). src/hotspot/cpu/x86/matcher_x86.hpp line 195: > 193: return 0; > 194: case Op_MulVB: > 195: return 7; Why do you unconditionally return `7` here? Is it because AVX512 doesn't feature a vector instruction to multiply byte vectors? ------------- PR: https://git.openjdk.org/jdk/pull/10571 From kvn at openjdk.org Thu Oct 13 19:17:43 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Oct 2022 19:17:43 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v3] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 10:03:24 GMT, Fei Gao wrote: >> After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize >> the case below by enabling -XX:+UseCMoveUnconditionally and >> -XX:+UseVectorCmov: >> >> // double[] a, double[] b, double[] c; >> for (int i = 0; i < a.length; i++) { >> c[i] = (a[i] > b[i]) ? a[i] : b[i]; >> } >> >> >> But we don't support the case like: >> >> // double[] a; >> // int seed; >> for (int i = 0; i < a.length; i++) { >> a[i] = (i % 2 == 0) ? seed + i : seed - i; >> } >> >> because the IR nodes for the CMoveD in the loop is: >> >> AddI AndI AddD SubD >> \ / / / >> CmpI / / >> \ / / >> Bool / / >> \ / / >> CMoveD >> >> >> and it is not our target pattern, which requires that the inputs >> of Cmp node must be the same as the inputs of CMove node >> as commented in CMoveKit::make_cmovevd_pack(). Because >> we can't vectorize the CMoveD pack, we shouldn't vectorize >> its inputs, AddD and SubD. But the current function >> CMoveKit::make_cmovevd_pack() doesn't clear the unqualified >> CMoveD pack from the packset. In this way, superword wrongly >> vectorizes AddD and SubD. Finally, we get a scalar CMoveD >> node with two vector inputs, AddVD and SubVD, which has >> wrong mixing types, then the assertion fails. >> >> To fix it, we need to remove the unvectorized CMoveD pack >> from the packset and clear related map info. > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Clean up the code style Looks reasonable. Does `compiler/c2/irTests/TestVectorConditionalMove.java` IR test cover this case? Can you add it if it is not already? ------------- PR: https://git.openjdk.org/jdk/pull/10627 From kvn at openjdk.org Thu Oct 13 19:27:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Oct 2022 19:27:41 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Thu, 13 Oct 2022 12:00:42 GMT, Christian Hagedorn wrote: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Very nice improvement and cleanup. How you handle repeated phases? Like ones around iterative EA and in `PHASEIDEALLOOP_ITERATIONS`. ------------- PR: https://git.openjdk.org/jdk/pull/10695 From duke at openjdk.org Thu Oct 13 20:34:11 2022 From: duke at openjdk.org (Joshua Cao) Date: Thu, 13 Oct 2022 20:34:11 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: > Example: > > > [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello > CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true > 223 12 3 java.lang.String::length (11 bytes) > 405 307 4 java.lang.String::length (11 bytes) > hello world > > > Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. > > --- > > Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. > > I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Set CompileTask directive on initialization, add tests for PrintCompilation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10668/files - new: https://git.openjdk.org/jdk/pull/10668/files/552a9a81..2eb4c9c8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10668&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10668&range=00-01 Stats: 194 lines in 5 files changed: 178 ins; 11 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10668.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10668/head:pull/10668 PR: https://git.openjdk.org/jdk/pull/10668 From duke at openjdk.org Thu Oct 13 20:34:12 2022 From: duke at openjdk.org (Joshua Cao) Date: Thu, 13 Oct 2022 20:34:12 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 19:20:37 GMT, Vladimir Kozlov wrote: >> I can add new test, but I do not think BooleanTest can be adapted. BooleanTest.java specifically tests boolean flags, and would not work for compile commands. > > Yes to new test. But we still need to test global `-XX:+PrintCompilation` I added tests for `-XX:+PrintCompilation` and `-XX:CompileCommand=PrintCompilation,...` ------------- PR: https://git.openjdk.org/jdk/pull/10668 From duke at openjdk.org Thu Oct 13 20:40:26 2022 From: duke at openjdk.org (Joshua Cao) Date: Thu, 13 Oct 2022 20:40:26 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: <3nLwjSVin4fM6Cr-hwO6o2uppYGmdKFfYRfihaYoT6c=.8951bc7f-f1de-49e3-ad1a-ddbdd80e3209@github.com> References: <5IRZaWg2_zpcI95aHeI77y_QlUwg7-A9zFBHk6qwLL8=.3cbe4abc-3c39-4534-82ab-6af4e5243942@github.com> <61NtkSyfKGssnMw900te0rLvloG-k43PCsY7KqsNiR4=.a9c2f4fe-fb2d-445d-97ea-02b0dc762482@github.com> <3nLwjSVin4fM6Cr-hwO6o2uppYGmdKFfYRfihaYoT6c=.8951bc7f-f1de-49e3-ad1a-ddbdd80e3209@github.com> Message-ID: On Thu, 13 Oct 2022 07:33:37 GMT, Christian Hagedorn wrote: >> May be we are looking the wrong way. Can we do `task->set_directive(directive)` when we create `task`? >> Then you don't need to move PrintCompilation code - `task->directive()` will be available at the beginning of this method. > >> elapsedTimer time; just declares variable. TraceTime uses it (with start/stop) later to accumulate time. > So declaration can be placed anywhere before JVMCI code below. > > Got it, thanks for the clarification - then it does not matter much where we place this declaration. I pushed a change that sets the directive in `CompileTask::initialize()`. To make this change work, I had to remove the `const` from `CompileTask::_directive`. This is because the broker needs a non const version of the directive. This does not change much. Prior to this PR, the broker would `task->setDirective(directive)`, where `directive` is not a const. `directive` is later passed in to the constructor for `Compilation` as a non const. So although the `CompileTask::_directive` was `const DirectiveSet *`, the directive can still be modified from a different reference. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From chagedorn at openjdk.org Fri Oct 14 08:35:05 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 14 Oct 2022 08:35:05 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Thu, 13 Oct 2022 12:00:42 GMT, Christian Hagedorn wrote: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Thank you Vladimir! I forgot to mention that in the summary above. By default, I'm overriding repeated compile phases and only keep the output of the the very last compile phase. There is one exception though: For `CompilePhase.BEFORE_CLOOPS`, I will keep the first very first output as the expectation would rather be to have no `CountedLoop` nodes, yet, when matching on this compile phase: // Match on very first BEFORE_CLOOPS phase (there could be multiple phases for multiple loops in the code). BEFORE_CLOOPS("Before CountedLoop", RegexType.IDEAL_INDEPENDENT, ActionOnRepeat.KEEP_FIRST), ... private enum ActionOnRepeat { KEEP_FIRST, KEEP_LAST } ------------- PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Fri Oct 14 08:35:08 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 14 Oct 2022 08:35:08 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v3] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 10:03:24 GMT, Fei Gao wrote: >> After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize >> the case below by enabling -XX:+UseCMoveUnconditionally and >> -XX:+UseVectorCmov: >> >> // double[] a, double[] b, double[] c; >> for (int i = 0; i < a.length; i++) { >> c[i] = (a[i] > b[i]) ? a[i] : b[i]; >> } >> >> >> But we don't support the case like: >> >> // double[] a; >> // int seed; >> for (int i = 0; i < a.length; i++) { >> a[i] = (i % 2 == 0) ? seed + i : seed - i; >> } >> >> because the IR nodes for the CMoveD in the loop is: >> >> AddI AndI AddD SubD >> \ / / / >> CmpI / / >> \ / / >> Bool / / >> \ / / >> CMoveD >> >> >> and it is not our target pattern, which requires that the inputs >> of Cmp node must be the same as the inputs of CMove node >> as commented in CMoveKit::make_cmovevd_pack(). Because >> we can't vectorize the CMoveD pack, we shouldn't vectorize >> its inputs, AddD and SubD. But the current function >> CMoveKit::make_cmovevd_pack() doesn't clear the unqualified >> CMoveD pack from the packset. In this way, superword wrongly >> vectorizes AddD and SubD. Finally, we get a scalar CMoveD >> node with two vector inputs, AddVD and SubVD, which has >> wrong mixing types, then the assertion fails. >> >> To fix it, we need to remove the unvectorized CMoveD pack >> from the packset and clear related map info. > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Clean up the code style Thanks for the update, looks good and testing passed! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10627 From rcastanedalo at openjdk.org Fri Oct 14 09:06:09 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Oct 2022 09:06:09 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: <-sdsXTMvVcizkj8iRuAroLF3rTCTX_TC0kqjrbC1AhQ=.5121a3a4-07ef-41ec-9c5d-cd7144df8c9f@github.com> On Thu, 13 Oct 2022 12:00:42 GMT, Christian Hagedorn wrote: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Thanks for implementing this useful feature, Christian! I have tried it out again for my use case ([barrier elision tests for generational ZGC](https://github.com/robcasloz/zgc/tree/barrier-elision-tests)) and it works fine. I have also tested different combinations of phases and IR nodes and the tests pass/fail as expected. Nice that you added extensive tests for the IR framework itself (`test/hotspot/jtreg/testlibrary_tests/ir_framework`). I only have some minor comments and suggestions. test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 79: > 77: BEFORE_MATCHING("Before matching"), > 78: MATCHING("After matching", RegexType.MACH), > 79: GLOBAL_CODE_MOTION("Global code motion", RegexType.MACH), `MACHANALYSIS` is missing here, is this intentional? test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java line 31: > 29: /** > 30: * Test class that does not contain any applicable {@link IR @IR} annotations and therefore does not fail. It simply > 31: * returns a {@link SuccessResult} objects when being matched. Suggestion: * returns a {@link SuccessResult} object when being matched. test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java line 35: > 33: * placeholder strings are not replaced by regexes, yet). A raw constraint can be parsed into a {@link Constraint} by > 34: * calling {@link #parse(CompilePhase, String)}. This replaces the IR node placeholder strings by actual regexes and > 35: * merge composite nodes together. Suggestion: * merges composite nodes together. test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java line 45: > 43: * IR rule. Default compile phases of {@link IRNode} placeholder strings as found in {@link RawConstraint} objects are > 44: * replaced by the actual default phases. The resulting parsed {@link Constraint} objects which now belong to a > 45: * non-default compile phase are moved to the check attribute matchables which represent these compile phase. Suggestion: * non-default compile phase are moved to the check attribute matchables which represent these compile phases. test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/mapping/MultiPhaseRangeEntry.java line 51: > 49: > 50: /** > 51: * Checks that there is no compile phase overlap of Incomplete comment? test/hotspot/jtreg/testlibrary_tests/ir_framework/examples/IRExample.java line 46: > 44: * classes (there are rare exceptions). These strings represent special placeholder strings (referred to as > 45: * "IR placeholder string" or just "IR node") which are replaced by the framework by regexes depending on which compile > 46: * phases (defined with {@link IR#phase()) the IR rule should be applied on. If an IR node placeholder string cannot be Suggestion: * phases (defined with {@link IR#phase()}) the IR rule should be applied on. If an IR node placeholder string cannot be test/hotspot/jtreg/testlibrary_tests/ir_framework/examples/IRExample.java line 68: > 66: * > 67: *

> 68: * The {@link IR @IR} annotations provides two kinds of checks: Suggestion: * The {@link IR @IR} annotations provide two kinds of checks: test/hotspot/jtreg/testlibrary_tests/ir_framework/examples/IRExample.java line 104: > 102: * @see Test > 103: * @see TestFramework > 104: */ A large part of this documentation is duplicated from `test/hotspot/jtreg/compiler/lib/ir_framework/README.md`. For better maintainability I suggest to remove the duplicated text here and add a reference to `README.md`. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/10695 From qamai at openjdk.org Fri Oct 14 13:00:02 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Oct 2022 13:00:02 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v3] In-Reply-To: References: Message-ID: <10EQbZwIYZLVD8-FRrRlrjbMe1oTRlRct-_NEApMsDU=.b14d1e65-6357-49cb-8e8b-4f41c12de81b@github.com> On Thu, 13 Oct 2022 16:53:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, >> >> - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. >> - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. >> >> Vector API benchmark shows the results of `MUL` operations: >> >> Before After >> Benchmark (size) Mode Cnt Score Error Score Error Units Change >> Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% >> Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% >> Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% >> Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% >> Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% >> >> Please have a look and have some reviews, thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Merge branch 'master' into improveMulVB > - refactor conditions > - add vmulB for 8 bytes > - Merge branch 'master' into improveMulVB > - Merge branch 'master' into improveMulVB > - Merge branch 'master' into improveMulVB > - fix > - mulV Thanks a lot for your reviews and testings. May I integrate now? ------------- PR: https://git.openjdk.org/jdk/pull/10571 From qamai at openjdk.org Fri Oct 14 13:00:04 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Oct 2022 13:00:04 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v3] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 18:09:31 GMT, Vladimir Ivanov wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - Merge branch 'master' into improveMulVB >> - refactor conditions >> - add vmulB for 8 bytes >> - Merge branch 'master' into improveMulVB >> - Merge branch 'master' into improveMulVB >> - Merge branch 'master' into improveMulVB >> - fix >> - mulV > > src/hotspot/cpu/x86/matcher_x86.hpp line 195: > >> 193: return 0; >> 194: case Op_MulVB: >> 195: return 7; > > Why do you unconditionally return `7` here? Is it because AVX512 doesn't feature a vector instruction to multiply byte vectors? Yes AFAIK there is no instruction to perform byte vector multiplication directly on x86. ------------- PR: https://git.openjdk.org/jdk/pull/10571 From kvn at openjdk.org Fri Oct 14 14:25:57 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Oct 2022 14:25:57 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Thu, 13 Oct 2022 12:00:42 GMT, Christian Hagedorn wrote: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10695 From kvn at openjdk.org Fri Oct 14 14:42:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Oct 2022 14:42:04 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v3] In-Reply-To: <10EQbZwIYZLVD8-FRrRlrjbMe1oTRlRct-_NEApMsDU=.b14d1e65-6357-49cb-8e8b-4f41c12de81b@github.com> References: <10EQbZwIYZLVD8-FRrRlrjbMe1oTRlRct-_NEApMsDU=.b14d1e65-6357-49cb-8e8b-4f41c12de81b@github.com> Message-ID: On Fri, 14 Oct 2022 12:56:02 GMT, Quan Anh Mai wrote: > Thanks a lot for your reviews and testings. May I integrate now? Let me test the latest version before you push. ------------- PR: https://git.openjdk.org/jdk/pull/10571 From jbhateja at openjdk.org Fri Oct 14 15:23:12 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 14 Oct 2022 15:23:12 GMT Subject: RFR: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 20:40:19 GMT, Dean Long wrote: >> Problem occurs in iterative DF analysis during CCP optimization, meet operations drops the speculative types before converging participating lattice values since [include_speculative argument it receives is always set to false ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/type.hpp#L231)where as [equality check ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.cpp#L1751) in the failing assertion is performed against original type still carrying the speculative type. >> >> To fix this, type comparison in the assertion should also be done after stripping the speculative type, with this change intermittent assertion failures in several vector API tests reported in the bug report are no longer seen. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Could this bug explain JDK-8295028? Hi @dean-long, @vnkozlov , please let me know if this can be checked in. ------------- PR: https://git.openjdk.org/jdk/pull/10648 From kvn at openjdk.org Fri Oct 14 15:30:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Oct 2022 15:30:50 GMT Subject: RFR: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 20:40:19 GMT, Dean Long wrote: >> Problem occurs in iterative DF analysis during CCP optimization, meet operations drops the speculative types before converging participating lattice values since [include_speculative argument it receives is always set to false ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/type.hpp#L231)where as [equality check ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.cpp#L1751) in the failing assertion is performed against original type still carrying the speculative type. >> >> To fix this, type comparison in the assertion should also be done after stripping the speculative type, with this change intermittent assertion failures in several vector API tests reported in the bug report are no longer seen. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Could this bug explain JDK-8295028? > Hi @dean-long, @vnkozlov , please let me know if this can be checked in. Yes, you can push. ------------- PR: https://git.openjdk.org/jdk/pull/10648 From jbhateja at openjdk.org Fri Oct 14 16:39:14 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 14 Oct 2022 16:39:14 GMT Subject: Integrated: 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 12:19:05 GMT, Jatin Bhateja wrote: > Problem occurs in iterative DF analysis during CCP optimization, meet operations drops the speculative types before converging participating lattice values since [include_speculative argument it receives is always set to false ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/type.hpp#L231)where as [equality check ](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.cpp#L1751) in the failing assertion is performed against original type still carrying the speculative type. > > To fix this, type comparison in the assertion should also be done after stripping the speculative type, with this change intermittent assertion failures in several vector API tests reported in the bug report are no longer seen. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 0043d58c Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/0043d58c5d52c3b299a4b6dfcec34a7db5041aea Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8293531: C2: some vectorapi tests fail assert "Not monotonic" with flag -XX:TypeProfileLevel=222 Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/10648 From kvn at openjdk.org Fri Oct 14 17:44:58 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Oct 2022 17:44:58 GMT Subject: RFR: 8294865: x86: Improve the code generation of MulVB and MulVL [v3] In-Reply-To: <10EQbZwIYZLVD8-FRrRlrjbMe1oTRlRct-_NEApMsDU=.b14d1e65-6357-49cb-8e8b-4f41c12de81b@github.com> References: <10EQbZwIYZLVD8-FRrRlrjbMe1oTRlRct-_NEApMsDU=.b14d1e65-6357-49cb-8e8b-4f41c12de81b@github.com> Message-ID: On Fri, 14 Oct 2022 12:56:02 GMT, Quan Anh Mai wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - Merge branch 'master' into improveMulVB >> - refactor conditions >> - add vmulB for 8 bytes >> - Merge branch 'master' into improveMulVB >> - Merge branch 'master' into improveMulVB >> - Merge branch 'master' into improveMulVB >> - fix >> - mulV > > Thanks a lot for your reviews and testings. May I integrate now? @merykitty testing looks good. You can integrate now. ------------- PR: https://git.openjdk.org/jdk/pull/10571 From duke at openjdk.org Fri Oct 14 21:38:22 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 14 Oct 2022 21:38:22 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions Message-ID: Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. - Added a JMH perf test. - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. Perf before: Benchmark (dataSize) (provider) Mode Cnt Score Error Units Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s and after: Benchmark (dataSize) (provider) Mode Cnt Score Error Units Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s ------------- Commit messages: - missed white-space fix - - Fix whitespace and copyright statements - Merge remote-tracking branch 'vpaprotsk/master' into avx512-poly - Poly1305 AVX512 intrinsic for x86_64 Changes: https://git.openjdk.org/jdk/pull/10582/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288047 Stats: 1676 lines in 29 files changed: 1665 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 14 21:38:22 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 14 Oct 2022 21:38:22 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:28:26 GMT, vpaprotsk wrote: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s I am part of Intel Java Team ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Fri Oct 14 23:08:48 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Oct 2022 23:08:48 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 20:34:11 GMT, Joshua Cao wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Set CompileTask directive on initialization, add tests for > PrintCompilation Looks good. Let me test it. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From kvn at openjdk.org Sat Oct 15 02:45:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Oct 2022 02:45:02 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: <53pJd3Q_e_jqwgrlaRmIKzOwfEGap2JT69307d_ODM8=.b7409e43-dc5a-4c4c-a17b-e7c87f6793b7@github.com> On Thu, 13 Oct 2022 20:34:11 GMT, Joshua Cao wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Set CompileTask directive on initialization, add tests for > PrintCompilation My testing passed. I verified that new tests passed too. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10668 From qamai at openjdk.org Sat Oct 15 11:31:56 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 15 Oct 2022 11:31:56 GMT Subject: Integrated: 8294865: x86: Improve the code generation of MulVB and MulVL In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 10:33:27 GMT, Quan Anh Mai wrote: > Hi, > > This patch simplifies and improves the code generation of `MulVB` and `MulVL` nodes, > > - MulVB can be implemented by alternating `vmullw` on odd and even-index elements and combining the results. > - MulVL can be implemented on non-avx512dq by computing the product of each 32-bit half and adding the results together. > > Vector API benchmark shows the results of `MUL` operations: > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units Change > Byte64Vector.MUL 1024 thrpt 15 8948.607 ? 194.646 8860.404 ? 203.109 ops/ms -0.99% > Byte128Vector.MUL 1024 thrpt 15 12915.839 ? 291.262 13554.662 ? 488.695 ops/ms +4.95% > Byte256Vector.MUL 1024 thrpt 15 12129.959 ? 245.710 23279.276 ? 669.725 ops/ms +91.92% > Long128Vector.MUL 1024 thrpt 15 1183.663 ? 36.440 1489.892 ? 35.356 ops/ms +25.87% > Long256Vector.MUL 1024 thrpt 15 1911.802 ? 95.304 2834.088 ? 77.647 ops/ms +48.24% > > Please have a look and have some reviews, thank you very much. This pull request has now been integrated. Changeset: 404e8de1 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/404e8de1559adade31df98a83919841f080b5b89 Stats: 170 lines in 5 files changed: 15 ins; 64 del; 91 mod 8294865: x86: Improve the code generation of MulVB and MulVL Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/10571 From dcubed at openjdk.org Sat Oct 15 14:55:05 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Sat, 15 Oct 2022 14:55:05 GMT Subject: Integrated: 8295379: ProblemList java/lang/Float/Binary16Conversion.java in Xcomp mode on x64 In-Reply-To: References: Message-ID: On Sat, 15 Oct 2022 14:29:18 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList a couple of tests in -Xcomp mode on generic-x64: > > [JDK-8295379](https://bugs.openjdk.org/browse/JDK-8295379) ProblemList java/lang/Float/Binary16Conversion.java in Xcomp mode on x64 > [JDK-8295380](https://bugs.openjdk.org/browse/JDK-8295380) ProblemList gc/cslocker/TestCSLocker.java in Xcomp mode on x64 This pull request has now been integrated. Changeset: e7d0ab22 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/e7d0ab227ff86bb65abf7fbeb135ce657454200b Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod 8295379: ProblemList java/lang/Float/Binary16Conversion.java in Xcomp mode on x64 8295380: ProblemList gc/cslocker/TestCSLocker.java in Xcomp mode on x64 Reviewed-by: alanb ------------- PR: https://git.openjdk.org/jdk/pull/10719 From eliu at openjdk.org Mon Oct 17 05:22:02 2022 From: eliu at openjdk.org (Eric Liu) Date: Mon, 17 Oct 2022 05:22:02 GMT Subject: RFR: 8294186: AArch64: VectorMaskToLong failed on SVE2 machine with -XX:UseSVE=1 [v2] In-Reply-To: <8DwuwmReKGKRgl34NleQUXepGoosyWlRxEsQxtj_vbE=.ca094a2e-8e40-41cf-918f-828e15a72799@github.com> References: <8DwuwmReKGKRgl34NleQUXepGoosyWlRxEsQxtj_vbE=.ca094a2e-8e40-41cf-918f-828e15a72799@github.com> Message-ID: On Wed, 28 Sep 2022 14:31:21 GMT, Eric Liu wrote: >> C2_MacroAssembler::sve_vmask_tolong would fail on BITPERM supported SVE2 machine with "-XX:UseSVE=1". >> >> `BITPERM` is an optional feature in SVE2. With this feature, VectorMaskToLong has a more efficent implementation. For other cases, it should generate SVE1 code. >> >> [TEST] >> jdk/incubator/vector, hotspot/compiler/vectorapi passed on BITPERM supported SVE2 machine, with option -XX:UseSVE=(0, 1, 2). > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > Refine comment > > Change-Id: I785817a0068098e9c48221cb391ef776186ef5de @theRealAph Could you help to take a look? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10443 From xliu at openjdk.org Mon Oct 17 05:50:19 2022 From: xliu at openjdk.org (Xin Liu) Date: Mon, 17 Oct 2022 05:50:19 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 20:34:11 GMT, Joshua Cao wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Set CompileTask directive on initialization, add tests for > PrintCompilation LGTM. I am not a reviewer. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/10668 From fgao at openjdk.org Mon Oct 17 06:47:21 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 17 Oct 2022 06:47:21 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v4] In-Reply-To: References: Message-ID: > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > > > But we don't support the case like: > > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) { > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > > because the IR nodes for the CMoveD in the loop is: > > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node > as commented in CMoveKit::make_cmovevd_pack(). Because > we can't vectorize the CMoveD pack, we shouldn't vectorize > its inputs, AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD > node with two vector inputs, AddVD and SubVD, which has > wrong mixing types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack > from the packset and clear related map info. Fei Gao has updated the pull request incrementally with one additional commit since the last revision: Update IR framework testcase ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10627/files - new: https://git.openjdk.org/jdk/pull/10627/files/47ca7341..f65118cc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10627&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10627&range=02-03 Stats: 9 lines in 1 file changed: 9 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10627.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10627/head:pull/10627 PR: https://git.openjdk.org/jdk/pull/10627 From fgao at openjdk.org Mon Oct 17 06:47:21 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 17 Oct 2022 06:47:21 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v3] In-Reply-To: References: Message-ID: On Fri, 14 Oct 2022 08:32:53 GMT, Christian Hagedorn wrote: > Thanks for the update, looks good and testing passed! @chhagedorn thanks for your review and test work! ------------- PR: https://git.openjdk.org/jdk/pull/10627 From fgao at openjdk.org Mon Oct 17 06:47:22 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 17 Oct 2022 06:47:22 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v3] In-Reply-To: References: Message-ID: <2ts65aPJo7jcyM2g_CU132QsW601P-9G5OdyVoMC54A=.13294426-c802-482a-b175-4d782ea6b472@github.com> On Thu, 13 Oct 2022 19:13:43 GMT, Vladimir Kozlov wrote: > Does `compiler/c2/irTests/TestVectorConditionalMove.java` IR test cover this case? Can you add it if it is not already? Thanks for pointing it out @vnkozlov . Updated the IR testcase in the new commit. ------------- PR: https://git.openjdk.org/jdk/pull/10627 From haosun at openjdk.org Mon Oct 17 08:17:17 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 17 Oct 2022 08:17:17 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Thu, 13 Oct 2022 12:00:42 GMT, Christian Hagedorn wrote: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Hi, 1) For the following two files, the copyright year should be updated to 2022. src/hotspot/share/opto/phasetype.hpp test/hotspot/jtreg/compiler/lib/ir_framework/driver/FlagVMProcess.java 2) I tested this PR on one SVE supporting machine and found the IR verification failed for the following 4 cases. Mainly because the IR nodes are not created for these SVE related rules. test/hotspot/jtreg/compiler/vectorapi/AllBitsSetVectorMatchRuleTest.java test/hotspot/jtreg/compiler/vectorapi/VectorFusedMultiplyAddSubTest.java test/hotspot/jtreg/compiler/vectorapi/VectorGatherScatterTest.java test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java I suggest using the following updates. diff --git a/src/hotspot/share/opto/phasetype.hpp b/src/hotspot/share/opto/phasetype.hpp index 0c9d34f113c..ae48de7feb3 100644 --- a/src/hotspot/share/opto/phasetype.hpp +++ b/src/hotspot/share/opto/phasetype.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2012, 2021, Oracle and/or its affiliates. All rights reserved. + * Copyright (c) 2012, 2022, Oracle and/or its affiliates. All rights reserved. * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. * * This code is free software; you can redistribute it and/or modify it diff --git a/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java b/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java index 6c4be9dd770..369962a0308 100644 --- a/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java +++ b/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java @@ -553,6 +553,16 @@ public class IRNode { beforeMatchingNameRegex(LOAD_VECTOR, "LoadVector"); } + public static final String LOAD_VECTOR_GATHER = PREFIX + "LOAD_VECTOR_GATHER" + POSTFIX; + static { + beforeMatchingNameRegex(LOAD_VECTOR_GATHER, "LoadVectorGather"); + } + + public static final String LOAD_VECTOR_GATHER_MASKED = PREFIX + "LOAD_VECTOR_GATHER_MASKED" + POSTFIX; + static { + beforeMatchingNameRegex(LOAD_VECTOR_GATHER_MASKED, "LoadVectorGatherMasked"); + } + public static final String LONG_COUNTED_LOOP = PREFIX + "LONG_COUNTED_LOOP" + POSTFIX; static { String regex = START + "LongCountedLoop\\b" + MID + END; @@ -891,6 +901,16 @@ public class IRNode { beforeMatchingNameRegex(STORE_VECTOR, "StoreVector"); } + public static final String STORE_VECTOR_SCATTER = PREFIX + "STORE_VECTOR_SCATTER" + POSTFIX; + static { + beforeMatchingNameRegex(STORE_VECTOR_SCATTER, "StoreVectorScatter"); + } + + public static final String STORE_VECTOR_SCATTER_MASKED = PREFIX + "STORE_VECTOR_SCATTER_MASKED" + POSTFIX; + static { + beforeMatchingNameRegex(STORE_VECTOR_SCATTER_MASKED, "StoreVectorScatterMasked"); + } + public static final String SUB = PREFIX + "SUB" + POSTFIX; static { beforeMatchingNameRegex(SUB, "Sub(I|L|F|D)"); @@ -1066,16 +1086,56 @@ public class IRNode { machOnlyNameRegex(VFABD_MASKED, "vfabd_masked"); } + public static final String VFMSB_MASKED = PREFIX + "VFMSB_MASKED" + POSTFIX; + static { + machOnlyNameRegex(VFMSB_MASKED, "vfmsb_masked"); + } + + public static final String VFNMAD_MASKED = PREFIX + "VFNMAD_MASKED" + POSTFIX; + static { + machOnlyNameRegex(VFNMAD_MASKED, "vfnmad_masked"); + } + + public static final String VFNMSB_MASKED = PREFIX + "VFNMSB_MASKED" + POSTFIX; + static { + machOnlyNameRegex(VFNMSB_MASKED, "vfnmsb_masked"); + } + + public static final String VMASK_AND_NOT_L = PREFIX + "VMASK_AND_NOT_L" + POSTFIX; + static { + machOnlyNameRegex(VMASK_AND_NOT_L, "vmask_and_notL"); + } + public static final String VMLA = PREFIX + "VMLA" + POSTFIX; static { machOnlyNameRegex(VMLA, "vmla"); } + public static final String VMLA_MASKED = PREFIX + "VMLA_MASKED" + POSTFIX; + static { + machOnlyNameRegex(VMLA_MASKED, "vmla_masked"); + } + public static final String VMLS = PREFIX + "VMLS" + POSTFIX; static { machOnlyNameRegex(VMLS, "vmls"); } + public static final String VMLS_MASKED = PREFIX + "VMLS_MASKED" + POSTFIX; + static { + machOnlyNameRegex(VMLS_MASKED, "vmls_masked"); + } + + public static final String VNOT_I_MASKED = PREFIX + "VNOT_I_MASKED" + POSTFIX; + static { + machOnlyNameRegex(VNOT_I_MASKED, "vnotI_masked"); + } + + public static final String VNOT_L_MASKED = PREFIX + "VNOT_L_MASKED" + POSTFIX; + static { + machOnlyNameRegex(VNOT_L_MASKED, "vnotL_masked"); + } + public static final String XOR = PREFIX + "XOR" + POSTFIX; static { beforeMatchingNameRegex(XOR, "Xor(I|L)"); diff --git a/test/hotspot/jtreg/compiler/lib/ir_framework/driver/FlagVMProcess.java b/test/hotspot/jtreg/compiler/lib/ir_framework/driver/FlagVMProcess.java index 2a58d038e4a..f15d4bec398 100644 --- a/test/hotspot/jtreg/compiler/lib/ir_framework/driver/FlagVMProcess.java +++ b/test/hotspot/jtreg/compiler/lib/ir_framework/driver/FlagVMProcess.java @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021, Oracle and/or its affiliates. All rights reserved. + * Copyright (c) 2021, 2022, Oracle and/or its affiliates. All rights reserved. * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. * * This code is free software; you can redistribute it and/or modify it diff --git a/test/hotspot/jtreg/compiler/vectorapi/AllBitsSetVectorMatchRuleTest.java b/test/hotspot/jtreg/compiler/vectorapi/AllBitsSetVectorMatchRuleTest.java index fd790e55975..e490174e380 100644 --- a/test/hotspot/jtreg/compiler/vectorapi/AllBitsSetVectorMatchRuleTest.java +++ b/test/hotspot/jtreg/compiler/vectorapi/AllBitsSetVectorMatchRuleTest.java @@ -98,7 +98,8 @@ public class AllBitsSetVectorMatchRuleTest { @Test @Warmup(10000) - @IR(counts = {IRNode.VAND_NOT_L, " >= 1" }) + @IR(counts = { IRNode.VAND_NOT_L, " >= 1" }, applyIf = {"UseSVE", "0"}) + @IR(counts = { IRNode.VMASK_AND_NOT_L, " >= 1" }, applyIf = {"UseSVE", "> 0"}) public static void testAllBitsSetMask() { VectorMask avm = VectorMask.fromArray(L_SPECIES, ma, 0); VectorMask bvm = VectorMask.fromArray(L_SPECIES, mb, 0); diff --git a/test/hotspot/jtreg/compiler/vectorapi/VectorFusedMultiplyAddSubTest.java b/test/hotspot/jtreg/compiler/vectorapi/VectorFusedMultiplyAddSubTest.java index e8966bd18d3..ece446bd197 100644 --- a/test/hotspot/jtreg/compiler/vectorapi/VectorFusedMultiplyAddSubTest.java +++ b/test/hotspot/jtreg/compiler/vectorapi/VectorFusedMultiplyAddSubTest.java @@ -224,7 +224,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmla_masked", ">= 1" }) + @IR(counts = { IRNode.VMLA_MASKED, ">= 1" }) public static void testByteMultiplyAddMasked() { VectorMask mask = VectorMask.fromArray(B_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += B_SPECIES.length()) { @@ -237,7 +237,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmls_masked", ">= 1" }) + @IR(counts = { IRNode.VMLS_MASKED, ">= 1" }) public static void testByteMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(B_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += B_SPECIES.length()) { @@ -250,7 +250,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmla_masked", ">= 1" }) + @IR(counts = { IRNode.VMLA_MASKED, ">= 1" }) public static void testShortMultiplyAddMasked() { VectorMask mask = VectorMask.fromArray(S_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += S_SPECIES.length()) { @@ -263,7 +263,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmls_masked", ">= 1" }) + @IR(counts = { IRNode.VMLS_MASKED, ">= 1" }) public static void testShortMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(S_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += S_SPECIES.length()) { @@ -276,7 +276,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmla_masked", ">= 1" }) + @IR(counts = { IRNode.VMLA_MASKED, ">= 1" }) public static void testIntMultiplyAddMasked() { VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += I_SPECIES.length()) { @@ -289,7 +289,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmls_masked", ">= 1" }) + @IR(counts = { IRNode.VMLS_MASKED, ">= 1" }) public static void testIntMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += I_SPECIES.length()) { @@ -302,7 +302,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmla_masked", ">= 1" }) + @IR(counts = { IRNode.VMLA_MASKED, ">= 1" }) public static void testLongMultiplyAddMasked() { VectorMask mask = VectorMask.fromArray(L_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += L_SPECIES.length()) { @@ -315,7 +315,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vmls_masked", ">= 1" }) + @IR(counts = { IRNode.VMLS_MASKED, ">= 1" }) public static void testLongMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(L_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += L_SPECIES.length()) { @@ -328,7 +328,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vfmsb_masked", ">= 1" }) + @IR(counts = { IRNode.VFMSB_MASKED, ">= 1" }) public static void testFloatMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(F_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += F_SPECIES.length()) { @@ -341,7 +341,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vfnmad_masked", ">= 1" }) + @IR(counts = { IRNode.VFNMAD_MASKED, ">= 1" }) public static void testFloatNegatedMultiplyAddMasked() { VectorMask mask = VectorMask.fromArray(F_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += F_SPECIES.length()) { @@ -354,7 +354,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vfnmsb_masked", ">= 1" }) + @IR(counts = { IRNode.VFNMSB_MASKED, ">= 1" }) public static void testFloatNegatedMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(F_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += F_SPECIES.length()) { @@ -367,7 +367,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vfmsb_masked", ">= 1" }) + @IR(counts = { IRNode.VFMSB_MASKED, ">= 1" }) public static void testDoubleMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(D_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += D_SPECIES.length()) { @@ -380,7 +380,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vfnmad_masked", ">= 1" }) + @IR(counts = { IRNode.VFNMAD_MASKED, ">= 1" }) public static void testDoubleNegatedMultiplyAddMasked() { VectorMask mask = VectorMask.fromArray(D_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += D_SPECIES.length()) { @@ -393,7 +393,7 @@ public class VectorFusedMultiplyAddSubTest { } @Test - @IR(counts = { "vfnmsb_masked", ">= 1" }) + @IR(counts = { IRNode.VFNMSB_MASKED, ">= 1" }) public static void testDoubleNegatedMultiplySubMasked() { VectorMask mask = VectorMask.fromArray(D_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += D_SPECIES.length()) { diff --git a/test/hotspot/jtreg/compiler/vectorapi/VectorGatherScatterTest.java b/test/hotspot/jtreg/compiler/vectorapi/VectorGatherScatterTest.java index fe2becc6d7a..fe626e5246d 100644 --- a/test/hotspot/jtreg/compiler/vectorapi/VectorGatherScatterTest.java +++ b/test/hotspot/jtreg/compiler/vectorapi/VectorGatherScatterTest.java @@ -85,7 +85,7 @@ public class VectorGatherScatterTest { @Test @Warmup(10000) - @IR(counts = { "LoadVectorGather", ">= 1" }) + @IR(counts = { IRNode.LOAD_VECTOR_GATHER, ">= 1" }) public static void testLoadGather() { LongVector av = LongVector.fromArray(L_SPECIES, la, 0, ia, 0); av.intoArray(lr, 0); @@ -99,7 +99,7 @@ public class VectorGatherScatterTest { @Test @Warmup(10000) - @IR(counts = { "LoadVectorGatherMasked", ">= 1" }) + @IR(counts = { IRNode.LOAD_VECTOR_GATHER_MASKED, ">= 1" }) public static void testLoadGatherMasked() { VectorMask mask = VectorMask.fromArray(L_SPECIES, m, 0); LongVector av = LongVector.fromArray(L_SPECIES, la, 0, ia, 0, mask); @@ -114,7 +114,7 @@ public class VectorGatherScatterTest { @Test @Warmup(10000) - @IR(counts = { "StoreVectorScatter", ">= 1" }) + @IR(counts = { IRNode.STORE_VECTOR_SCATTER, ">= 1" }) public static void testStoreScatter() { DoubleVector av = DoubleVector.fromArray(D_SPECIES, da, 0); av.intoArray(dr, 0, ia, 0); @@ -128,7 +128,7 @@ public class VectorGatherScatterTest { @Test @Warmup(10000) - @IR(counts = { "StoreVectorScatterMasked", ">= 1" }) + @IR(counts = { IRNode.STORE_VECTOR_SCATTER_MASKED, ">= 1" }) public static void testStoreScatterMasked() { VectorMask mask = VectorMask.fromArray(D_SPECIES, m, 0); DoubleVector av = DoubleVector.fromArray(D_SPECIES, da, 0); diff --git a/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java b/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java index 62ec52d87b9..62b1062d31e 100644 --- a/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java +++ b/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java @@ -77,7 +77,7 @@ public class VectorMaskedNotTest { @Test @Warmup(10000) - @IR(counts = { "vnotI_masked", ">= 1" }) + @IR(counts = { IRNode.VNOT_I_MASKED, ">= 1" }) public static void testIntNotMasked() { VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0); IntVector av = IntVector.fromArray(I_SPECIES, ia, 0); @@ -95,7 +95,7 @@ public class VectorMaskedNotTest { @Test @Warmup(10000) - @IR(counts = { "vnotL_masked", ">= 1" }) + @IR(counts = { IRNode.VNOT_L_MASKED, ">= 1" }) public static void testLongNotMasked() { VectorMask mask = VectorMask.fromArray(L_SPECIES, m, 0); LongVector av = LongVector.fromArray(L_SPECIES, la, 0); ------------- PR: https://git.openjdk.org/jdk/pull/10695 From aph at openjdk.org Mon Oct 17 08:22:57 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 17 Oct 2022 08:22:57 GMT Subject: RFR: 8294186: AArch64: VectorMaskToLong failed on SVE2 machine with -XX:UseSVE=1 [v2] In-Reply-To: <8DwuwmReKGKRgl34NleQUXepGoosyWlRxEsQxtj_vbE=.ca094a2e-8e40-41cf-918f-828e15a72799@github.com> References: <8DwuwmReKGKRgl34NleQUXepGoosyWlRxEsQxtj_vbE=.ca094a2e-8e40-41cf-918f-828e15a72799@github.com> Message-ID: On Wed, 28 Sep 2022 14:31:21 GMT, Eric Liu wrote: >> C2_MacroAssembler::sve_vmask_tolong would fail on BITPERM supported SVE2 machine with "-XX:UseSVE=1". >> >> `BITPERM` is an optional feature in SVE2. With this feature, VectorMaskToLong has a more efficent implementation. For other cases, it should generate SVE1 code. >> >> [TEST] >> jdk/incubator/vector, hotspot/compiler/vectorapi passed on BITPERM supported SVE2 machine, with option -XX:UseSVE=(0, 1, 2). > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > Refine comment > > Change-Id: I785817a0068098e9c48221cb391ef776186ef5de Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10443 From tholenstein at openjdk.org Mon Oct 17 09:14:06 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 17 Oct 2022 09:14:06 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v19] In-Reply-To: References: Message-ID: > Cleanup of the code in IGV without changing the functionality. > > - removed dead code (unused classes, functions, variables) from the IGV code base > - merged (and removed) redundant functions > - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV > - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package > - made class variables `final` whenever possible > - removed `this.` in `this.funtion()` funciton calls when it was not needed > - used lambdas instead of anonymous class if possible > - fixed whitespace issues (e.g. double whitespace) > - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` > - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: - remove whitespace Co-authored-by: Andrey Turbanov - add whitespace Co-authored-by: Andrey Turbanov - Update src/utils/IdealGraphVisualizer/Data/src/main/java/com/sun/hotspot/igv/data/InputEdge.java add whitespace Co-authored-by: Andrey Turbanov ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/891adb1b..d5044588 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=18 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=17-18 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From ihse at openjdk.org Mon Oct 17 09:20:15 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 17 Oct 2022 09:20:15 GMT Subject: RFR: 8295262: Build binutils out of source tree [v2] In-Reply-To: References: Message-ID: <_hJi-YdLKH3Sh2dZFiNoym6rGmrRdCGsCETpfbynh1I=.fb3f5139-3606-4ffc-82cc-2e9d1a11c2da@github.com> On Thu, 13 Oct 2022 08:41:35 GMT, Ludovic Henry wrote: >> Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. >> >> The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. > > Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Remove unrelated change > - Merge branch 'master' of github.com:openjdk/jdk into dev/ludovic/upstream-hsdis-cross-compile > - 8295262: Build binutils out of source tree > > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. Other than that, it looks good. make/autoconf/lib-hsdis.m4 line 137: > 135: UTIL_FIXUP_PATH(BINUTILS_SRC) > 136: > 137: BINUTILS_DIR="${OUTPUTDIR}/binutils" Suggestion: BINUTILS_DIR="$CONFIGURESUPPORT_OUTPUTDIR/binutils" Rationale: "configure-support" includes things that are created by the configure script. We try to avoid cluttering the top level build directory. ------------- Marked as reviewed by ihse (Reviewer). PR: https://git.openjdk.org/jdk/pull/10689 From luhenry at openjdk.org Mon Oct 17 09:37:08 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 17 Oct 2022 09:37:08 GMT Subject: RFR: 8295262: Build binutils out of source tree [v3] In-Reply-To: References: Message-ID: > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. Ludovic Henry has updated the pull request incrementally with one additional commit since the last revision: Update make/autoconf/lib-hsdis.m4 Co-authored-by: Magnus Ihse Bursie ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10689/files - new: https://git.openjdk.org/jdk/pull/10689/files/8faf5083..c78d2f02 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10689&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10689&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10689.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10689/head:pull/10689 PR: https://git.openjdk.org/jdk/pull/10689 From luhenry at openjdk.org Mon Oct 17 09:37:10 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 17 Oct 2022 09:37:10 GMT Subject: RFR: 8295262: Build binutils out of source tree [v2] In-Reply-To: <_hJi-YdLKH3Sh2dZFiNoym6rGmrRdCGsCETpfbynh1I=.fb3f5139-3606-4ffc-82cc-2e9d1a11c2da@github.com> References: <_hJi-YdLKH3Sh2dZFiNoym6rGmrRdCGsCETpfbynh1I=.fb3f5139-3606-4ffc-82cc-2e9d1a11c2da@github.com> Message-ID: <7ir20KAH1sJ8GnXNVSdB4tyqlj0tommVEnYdPsaBgZM=.008b3839-a21c-4295-a249-6ee3a086942d@github.com> On Mon, 17 Oct 2022 09:17:09 GMT, Magnus Ihse Bursie wrote: >> Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Remove unrelated change >> - Merge branch 'master' of github.com:openjdk/jdk into dev/ludovic/upstream-hsdis-cross-compile >> - 8295262: Build binutils out of source tree >> >> Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. >> >> The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. > > Other than that, it looks good. @magicus thanks! I committed your change. ------------- PR: https://git.openjdk.org/jdk/pull/10689 From ihse at openjdk.org Mon Oct 17 12:24:35 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 17 Oct 2022 12:24:35 GMT Subject: RFR: 8295262: Build binutils out of source tree [v3] In-Reply-To: References: Message-ID: On Mon, 17 Oct 2022 09:37:08 GMT, Ludovic Henry wrote: >> Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. >> >> The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. > > Ludovic Henry has updated the pull request incrementally with one additional commit since the last revision: > > Update make/autoconf/lib-hsdis.m4 > > Co-authored-by: Magnus Ihse Bursie Marked as reviewed by ihse (Reviewer). If you redo `/integrate`, I can sponsor. ------------- PR: https://git.openjdk.org/jdk/pull/10689 From tholenstein at openjdk.org Mon Oct 17 12:34:50 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 17 Oct 2022 12:34:50 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v18] In-Reply-To: <27ETKLx8aWB7ixXbmfDXnGFFcFQNNnOLx9WHvPDQPNQ=.17cf231d-c617-4a15-ac96-077995b0e9ce@github.com> References: <27ETKLx8aWB7ixXbmfDXnGFFcFQNNnOLx9WHvPDQPNQ=.17cf231d-c617-4a15-ac96-077995b0e9ce@github.com> Message-ID: <-MYsNg2UbiUMkdAAv_tqkB0B9VZpyrI2-Ub3w1FkMpk=.4889d1bb-c1fb-4d6b-9c4d-ecd2bf05728c@github.com> On Tue, 4 Oct 2022 16:57:56 GMT, Roberto Casta?eda Lozano wrote: > select Hi @robcasloz , This looks like this is https://bugs.openjdk.org/browse/JDK-8294565 ? If so, it was already in present before this change-set ------------- PR: https://git.openjdk.org/jdk/pull/10197 From shade at openjdk.org Mon Oct 17 13:24:48 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 17 Oct 2022 13:24:48 GMT Subject: RFR: 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts In-Reply-To: References: Message-ID: <7lxFBVNoY-DB-PyZo_Cto3arAJoLN3Dw959Wut90u7w=.1b7e8c72-fa5a-466e-96c0-647ceae47fed@github.com> On Thu, 13 Oct 2022 12:03:16 GMT, Aleksey Shipilev wrote: > Fails on many platforms, for example arm32: Any takers? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10696 From eliu at openjdk.org Mon Oct 17 13:32:54 2022 From: eliu at openjdk.org (Eric Liu) Date: Mon, 17 Oct 2022 13:32:54 GMT Subject: Integrated: 8294186: AArch64: VectorMaskToLong failed on SVE2 machine with -XX:UseSVE=1 In-Reply-To: References: Message-ID: <9N5E8qgFssPIPjGp4Z_UnwUJjEoJqD1JbjuEKnlgMQo=.c3c8aa62-6e30-4493-8277-4a690d2e9571@github.com> On Tue, 27 Sep 2022 09:48:24 GMT, Eric Liu wrote: > C2_MacroAssembler::sve_vmask_tolong would fail on BITPERM supported SVE2 machine with "-XX:UseSVE=1". > > `BITPERM` is an optional feature in SVE2. With this feature, VectorMaskToLong has a more efficent implementation. For other cases, it should generate SVE1 code. > > [TEST] > jdk/incubator/vector, hotspot/compiler/vectorapi passed on BITPERM supported SVE2 machine, with option -XX:UseSVE=(0, 1, 2). This pull request has now been integrated. Changeset: 0919a3a0 Author: Eric Liu URL: https://git.openjdk.org/jdk/commit/0919a3a0c198a5234b5ed9a3bb999564d2382a56 Stats: 32 lines in 1 file changed: 13 ins; 13 del; 6 mod 8294186: AArch64: VectorMaskToLong failed on SVE2 machine with -XX:UseSVE=1 Reviewed-by: njian, aph ------------- PR: https://git.openjdk.org/jdk/pull/10443 From luhenry at openjdk.org Mon Oct 17 14:06:46 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 17 Oct 2022 14:06:46 GMT Subject: Integrated: 8295262: Build binutils out of source tree In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:43:33 GMT, Ludovic Henry wrote: > Currently, when passing --with-binutils-src, binutils is built in the source tree. That leads to conflicting targets when compiling for different architectures (ex: amd64 on the host, and riscv64 or aarch64 for the target) from the same jdk source tree. > > The simplest solution is to build binutils out-of-tree and into the build//binutils folder. These out-of-tree builds are already supported by binutils and only require some changes in the way we invoke the binutils/configure command. This pull request has now been integrated. Changeset: 4d37ef2d Author: Ludovic Henry Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/4d37ef2d545c016e6c3ad52171ea961d4406726f Stats: 24 lines in 1 file changed: 12 ins; 2 del; 10 mod 8295262: Build binutils out of source tree Reviewed-by: shade, erikj, ihse ------------- PR: https://git.openjdk.org/jdk/pull/10689 From kvn at openjdk.org Mon Oct 17 14:34:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Oct 2022 14:34:50 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v4] In-Reply-To: References: Message-ID: On Mon, 17 Oct 2022 06:47:21 GMT, Fei Gao wrote: >> After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize >> the case below by enabling -XX:+UseCMoveUnconditionally and >> -XX:+UseVectorCmov: >> >> // double[] a, double[] b, double[] c; >> for (int i = 0; i < a.length; i++) { >> c[i] = (a[i] > b[i]) ? a[i] : b[i]; >> } >> >> >> But we don't support the case like: >> >> // double[] a; >> // int seed; >> for (int i = 0; i < a.length; i++) { >> a[i] = (i % 2 == 0) ? seed + i : seed - i; >> } >> >> because the IR nodes for the CMoveD in the loop is: >> >> AddI AndI AddD SubD >> \ / / / >> CmpI / / >> \ / / >> Bool / / >> \ / / >> CMoveD >> >> >> and it is not our target pattern, which requires that the inputs >> of Cmp node must be the same as the inputs of CMove node >> as commented in CMoveKit::make_cmovevd_pack(). Because >> we can't vectorize the CMoveD pack, we shouldn't vectorize >> its inputs, AddD and SubD. But the current function >> CMoveKit::make_cmovevd_pack() doesn't clear the unqualified >> CMoveD pack from the packset. In this way, superword wrongly >> vectorizes AddD and SubD. Finally, we get a scalar CMoveD >> node with two vector inputs, AddVD and SubVD, which has >> wrong mixing types, then the assertion fails. >> >> To fix it, we need to remove the unvectorized CMoveD pack >> from the packset and clear related map info. > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Update IR framework testcase Thank you for updating test. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10627 From tholenstein at openjdk.org Mon Oct 17 15:40:59 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 17 Oct 2022 15:40:59 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v20] In-Reply-To: References: Message-ID: <0FbySLoyt8bfmd94aMZ2cHKwjWEqZqjeWY6OGZPx6x8=.f34a5aa9-6941-44db-86b0-09f11cb31267@github.com> > Cleanup of the code in IGV without changing the functionality. > > - removed dead code (unused classes, functions, variables) from the IGV code base > - merged (and removed) redundant functions > - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV > - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package > - made class variables `final` whenever possible > - removed `this.` in `this.funtion()` funciton calls when it was not needed > - used lambdas instead of anonymous class if possible > - fixed whitespace issues (e.g. double whitespace) > - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` > - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: re-add ConnectionSet class is needed to highlight hovered edges ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/d5044588..96425deb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=19 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=18-19 Stats: 22 lines in 1 file changed: 14 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From shade at openjdk.org Mon Oct 17 18:21:48 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 17 Oct 2022 18:21:48 GMT Subject: RFR: 8294467: Fix sequence-point warnings in Hotspot [v2] In-Reply-To: References: Message-ID: > There seem to be the only place in Hotspot where this warning fires, yet the warning is disabled wholesale for Hotspot. This is not good. > > I can trace the addition of sequence-point exclusion to [JDK-8211029](https://bugs.openjdk.org/browse/JDK-8211029) (Sep 2018), yet the only place where it triggers introduced by [JDK-8259609](https://bugs.openjdk.org/browse/JDK-8259609) (Oct 2021). It seems other places were fixed meanwhile. > > I believe the fixed place is just a simple leftover. Right, @rwestrel? > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] The build matrix of: > - GCC 10 > - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} > - {server} > - {release, fastdebug} Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge branch 'master' into JDK-8294467-warning-sequence-point - Fix ------------- Changes: https://git.openjdk.org/jdk/pull/10454/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10454&range=01 Stats: 2 lines in 2 files changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10454.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10454/head:pull/10454 PR: https://git.openjdk.org/jdk/pull/10454 From jiefu at openjdk.org Tue Oct 18 01:22:44 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 18 Oct 2022 01:22:44 GMT Subject: RFR: 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 12:03:16 GMT, Aleksey Shipilev wrote: > Fails on many platforms, for example arm32: LGTM ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.org/jdk/pull/10696 From fgao at openjdk.org Tue Oct 18 01:27:02 2022 From: fgao at openjdk.org (Fei Gao) Date: Tue, 18 Oct 2022 01:27:02 GMT Subject: RFR: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov [v4] In-Reply-To: References: Message-ID: On Mon, 17 Oct 2022 06:47:21 GMT, Fei Gao wrote: >> After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize >> the case below by enabling -XX:+UseCMoveUnconditionally and >> -XX:+UseVectorCmov: >> >> // double[] a, double[] b, double[] c; >> for (int i = 0; i < a.length; i++) { >> c[i] = (a[i] > b[i]) ? a[i] : b[i]; >> } >> >> >> But we don't support the case like: >> >> // double[] a; >> // int seed; >> for (int i = 0; i < a.length; i++) { >> a[i] = (i % 2 == 0) ? seed + i : seed - i; >> } >> >> because the IR nodes for the CMoveD in the loop is: >> >> AddI AndI AddD SubD >> \ / / / >> CmpI / / >> \ / / >> Bool / / >> \ / / >> CMoveD >> >> >> and it is not our target pattern, which requires that the inputs >> of Cmp node must be the same as the inputs of CMove node >> as commented in CMoveKit::make_cmovevd_pack(). Because >> we can't vectorize the CMoveD pack, we shouldn't vectorize >> its inputs, AddD and SubD. But the current function >> CMoveKit::make_cmovevd_pack() doesn't clear the unqualified >> CMoveD pack from the packset. In this way, superword wrongly >> vectorizes AddD and SubD. Finally, we get a scalar CMoveD >> node with two vector inputs, AddVD and SubVD, which has >> wrong mixing types, then the assertion fails. >> >> To fix it, we need to remove the unvectorized CMoveD pack >> from the packset and clear related map info. > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Update IR framework testcase Thanks all for your review and comments. I'll integrate it. ------------- PR: https://git.openjdk.org/jdk/pull/10627 From xgong at openjdk.org Tue Oct 18 01:44:21 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 18 Oct 2022 01:44:21 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector [v2] In-Reply-To: References: Message-ID: > "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. > > This patch adds the vector intrinsic implementation of it. The steps are: > > 1) Load the const "iota" vector. > > We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. > > 2) Compute indexes with "`vec + iota * scale`" > > Here is the performance result to the new added micro benchmark on ARM NEON: > > Benchmark Gain > IndexVectorBenchmark.byteIndexVector 1.477 > IndexVectorBenchmark.doubleIndexVector 5.031 > IndexVectorBenchmark.floatIndexVector 5.342 > IndexVectorBenchmark.intIndexVector 5.529 > IndexVectorBenchmark.longIndexVector 3.177 > IndexVectorBenchmark.shortIndexVector 5.841 > > > Please help to review and share the feedback! Thanks in advance! Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Add the floating point support for VectorLoadConst and remove the VectorCast - Merge branch 'master' into JDK-8293409 - 8293409: [vectorapi] Intrinsify VectorSupport.indexVector ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10332/files - new: https://git.openjdk.org/jdk/pull/10332/files/2ad157b6..53f042d3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10332&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10332&range=00-01 Stats: 50675 lines in 1239 files changed: 30581 ins; 14395 del; 5699 mod Patch: https://git.openjdk.org/jdk/pull/10332.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10332/head:pull/10332 PR: https://git.openjdk.org/jdk/pull/10332 From xgong at openjdk.org Tue Oct 18 01:44:21 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 18 Oct 2022 01:44:21 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: <0yv4LhxY5GqaiuhoxdB7tmmJlik-m9B_2BYWkdDCSTU=.0c97a482-164d-4d14-8a3e-8a6b2c3a34c6@github.com> On Mon, 19 Sep 2022 08:51:24 GMT, Xiaohong Gong wrote: > "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. > > This patch adds the vector intrinsic implementation of it. The steps are: > > 1) Load the const "iota" vector. > > We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. > > 2) Compute indexes with "`vec + iota * scale`" > > Here is the performance result to the new added micro benchmark on ARM NEON: > > Benchmark Gain > IndexVectorBenchmark.byteIndexVector 1.477 > IndexVectorBenchmark.doubleIndexVector 5.031 > IndexVectorBenchmark.floatIndexVector 5.342 > IndexVectorBenchmark.intIndexVector 5.529 > IndexVectorBenchmark.longIndexVector 3.177 > IndexVectorBenchmark.shortIndexVector 5.841 > > > Please help to review and share the feedback! Thanks in advance! Hi @jatin-bhateja , all your comments have been addressed. Please help to look at the changes again! Thanks in advance! ------------- PR: https://git.openjdk.org/jdk/pull/10332 From xgong at openjdk.org Tue Oct 18 01:44:21 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 18 Oct 2022 01:44:21 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector [v2] In-Reply-To: <_wyFWAET_qXwwj-9Iq9AsPAGbT3AXIwN6HujmwZVRPw=.9c652886-4255-4c03-89d9-e3c74f9f319a@github.com> References: <_wyFWAET_qXwwj-9Iq9AsPAGbT3AXIwN6HujmwZVRPw=.9c652886-4255-4c03-89d9-e3c74f9f319a@github.com> Message-ID: On Thu, 13 Oct 2022 07:04:25 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Add the floating point support for VectorLoadConst and remove the VectorCast >> - Merge branch 'master' into JDK-8293409 >> - 8293409: [vectorapi] Intrinsify VectorSupport.indexVector > > src/hotspot/share/opto/vectorIntrinsics.cpp line 2949: > >> 2947: } else if (elem_bt == T_DOUBLE) { >> 2948: iota = gvn().transform(new VectorCastL2XNode(iota, vt)); >> 2949: } > > Since we are loading constants from stub initialized memory locations, defining new stubs for floating point iota indices may eliminate need for costly conversion instructions. Specially on X86 conversion b/w Long and Double is only supported by AVX512DQ targets and intrinsification may fail for legacy targets. Make sense to me! I'v changed the codes based on the suggestion in the latest commit. Please help to take a review again! Thanks a lot! ------------- PR: https://git.openjdk.org/jdk/pull/10332 From njian at openjdk.org Tue Oct 18 02:01:07 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Tue, 18 Oct 2022 02:01:07 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 10:12:42 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Modified JTREG test to include feature constraints lgtm ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.org/jdk/pull/10407 From fgao at openjdk.org Tue Oct 18 02:05:04 2022 From: fgao at openjdk.org (Fei Gao) Date: Tue, 18 Oct 2022 02:05:04 GMT Subject: Integrated: 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 06:12:11 GMT, Fei Gao wrote: > After JDK-8139340, JDK-8192846 and JDK-8289422, we can vectorize > the case below by enabling -XX:+UseCMoveUnconditionally and > -XX:+UseVectorCmov: > > // double[] a, double[] b, double[] c; > for (int i = 0; i < a.length; i++) { > c[i] = (a[i] > b[i]) ? a[i] : b[i]; > } > > > But we don't support the case like: > > // double[] a; > // int seed; > for (int i = 0; i < a.length; i++) { > a[i] = (i % 2 == 0) ? seed + i : seed - i; > } > > because the IR nodes for the CMoveD in the loop is: > > AddI AndI AddD SubD > \ / / / > CmpI / / > \ / / > Bool / / > \ / / > CMoveD > > > and it is not our target pattern, which requires that the inputs > of Cmp node must be the same as the inputs of CMove node > as commented in CMoveKit::make_cmovevd_pack(). Because > we can't vectorize the CMoveD pack, we shouldn't vectorize > its inputs, AddD and SubD. But the current function > CMoveKit::make_cmovevd_pack() doesn't clear the unqualified > CMoveD pack from the packset. In this way, superword wrongly > vectorizes AddD and SubD. Finally, we get a scalar CMoveD > node with two vector inputs, AddVD and SubVD, which has > wrong mixing types, then the assertion fails. > > To fix it, we need to remove the unvectorized CMoveD pack > from the packset and clear related map info. This pull request has now been integrated. Changeset: 490fcd0c Author: Fei Gao Committer: Ningsheng Jian URL: https://git.openjdk.org/jdk/commit/490fcd0c2547cb4e564363f0cd121c777c3acc02 Stats: 134 lines in 5 files changed: 76 ins; 6 del; 52 mod 8293833: Error mixing types with -XX:+UseCMoveUnconditionally -XX:+UseVectorCmov Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10627 From shade at openjdk.org Tue Oct 18 08:15:49 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Oct 2022 08:15:49 GMT Subject: RFR: 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 01:19:20 GMT, Jie Fu wrote: > LGTM Thank you! Trivial, or? ------------- PR: https://git.openjdk.org/jdk/pull/10696 From jiefu at openjdk.org Tue Oct 18 08:22:17 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 18 Oct 2022 08:22:17 GMT Subject: RFR: 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 08:12:11 GMT, Aleksey Shipilev wrote: > > LGTM > > Thank you! Trivial, or? Yes, I think it's trivial. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10696 From shade at openjdk.org Tue Oct 18 10:03:04 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Oct 2022 10:03:04 GMT Subject: RFR: 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 12:03:16 GMT, Aleksey Shipilev wrote: > Fails on many platforms, for example arm32: Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10696 From shade at openjdk.org Tue Oct 18 10:03:05 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Oct 2022 10:03:05 GMT Subject: Integrated: 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 12:03:16 GMT, Aleksey Shipilev wrote: > Fails on many platforms, for example arm32: This pull request has now been integrated. Changeset: e7a964b4 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/e7a964b4dbbdd21eba87dc94eb3680e9553f5039 Stats: 5 lines in 5 files changed: 0 ins; 0 del; 5 mod 8295268: Optimized builds are broken due to incorrect assert_is_rfp shortcuts Reviewed-by: jiefu ------------- PR: https://git.openjdk.org/jdk/pull/10696 From shade at openjdk.org Tue Oct 18 12:08:42 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Oct 2022 12:08:42 GMT Subject: RFR: 8295469: S390X: Optimized builds are broken Message-ID: * For target hotspot_variant-server_libjvm_objs_BUILD_LIBJVM_link: ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/10743/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10743&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295469 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10743.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10743/head:pull/10743 PR: https://git.openjdk.org/jdk/pull/10743 From stuefe at openjdk.org Tue Oct 18 12:46:49 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 18 Oct 2022 12:46:49 GMT Subject: RFR: 8295469: S390X: Optimized builds are broken In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 11:53:20 GMT, Aleksey Shipilev wrote: > * For target hotspot_variant-server_libjvm_objs_BUILD_LIBJVM_link: +1, trivial. Thanks for fixing. ------------- Marked as reviewed by stuefe (Reviewer). PR: https://git.openjdk.org/jdk/pull/10743 From chagedorn at openjdk.org Tue Oct 18 14:42:08 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 14:42:08 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: <-sdsXTMvVcizkj8iRuAroLF3rTCTX_TC0kqjrbC1AhQ=.5121a3a4-07ef-41ec-9c5d-cd7144df8c9f@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> <-sdsXTMvVcizkj8iRuAroLF3rTCTX_TC0kqjrbC1AhQ=.5121a3a4-07ef-41ec-9c5d-cd7144df8c9f@github.com> Message-ID: On Fri, 14 Oct 2022 07:26:58 GMT, Roberto Casta?eda Lozano wrote: >> This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: >> >> https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 >> >> The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. >> >> ## How does it work? >> >> ### Basic idea >> There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: >> >> >> int iFld; >> >> @Test >> @IR(counts = {IRNode.STORE_I, "1"}, >> phase = {CompilePhase.AFTER_PARSING, // Fails >> CompilePhase.ITER_GVN1}) // Works >> public void optimizeStores() { >> iFld = 42; >> iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 >> } >> >> In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: >> >> 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" >> > Phase "After Parsing": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" >> - Failed comparison: [found] 2 = 1 [given] >> - Matched nodes (2): >> * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) >> * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) >> >> >> More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. >> >> ### CompilePhase.DEFAULT - default compile phase >> The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). >> >> Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. >> >> Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. >> >> ### Different regexes for the same IRNode entry >> A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: >> >> - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node >> public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node >> >> - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; >> static { >> String idealIndependentRegex = START + "Allocate" + MID + END; >> String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; >> allocNodes(ALLOC, idealIndependentRegex, optoRegex); >> } >> >> **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** >> >> ### Using the IRNode entries correctly >> The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: >> - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). >> - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). >> - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. >> >> ## General Changes >> The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: >> >> - Added more packages to better group related classes together. >> - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. >> - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). >> - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) >> - Cleaned up and refactored a lot of code to use this new design. >> - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. >> - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. >> - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. >> - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. >> - Replaced implementation inheritance by interfaces. >> - Improved encapsulation of object data. >> - Updated README and many comments/class descriptions to reflect this new feature. >> - Added new IR framework tests >> >> ## Testing >> - Normal tier testing. >> - Applying the patch to Valhalla to perform tier testing. >> - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! >> >> Thanks, >> Christian > > test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 79: > >> 77: BEFORE_MATCHING("Before matching"), >> 78: MATCHING("After matching", RegexType.MACH), >> 79: GLOBAL_CODE_MOTION("Global code motion", RegexType.MACH), > > `MACHANALYSIS` is missing here, is this intentional? Good catch! That's a mistake. I've missed to add that after merging in master at some point > test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/mapping/MultiPhaseRangeEntry.java line 51: > >> 49: >> 50: /** >> 51: * Checks that there is no compile phase overlap of > > Incomplete comment? Indeed, I've completed that now. > test/hotspot/jtreg/testlibrary_tests/ir_framework/examples/IRExample.java line 104: > >> 102: * @see Test >> 103: * @see TestFramework >> 104: */ > > A large part of this documentation is duplicated from `test/hotspot/jtreg/compiler/lib/ir_framework/README.md`. For better maintainability I suggest to remove the duplicated text here and add a reference to `README.md`. You're right. I originally only had the text in the `IRExample.java` file but then I thought it might be better to have it in the `README` file as well - but then we can indeed get rid of the text in `IRExample.java`. I've cleaned that up. ------------- PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Tue Oct 18 14:44:50 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 14:44:50 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v2] In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with three additional commits since the last revision: - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java Co-authored-by: Roberto Casta?eda Lozano - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java Co-authored-by: Roberto Casta?eda Lozano - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java Co-authored-by: Roberto Casta?eda Lozano ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10695/files - new: https://git.openjdk.org/jdk/pull/10695/files/347f26e1..c29f5d48 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=00-01 Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10695/head:pull/10695 PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Tue Oct 18 14:50:31 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 14:50:31 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v3] In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Roberto's review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10695/files - new: https://git.openjdk.org/jdk/pull/10695/files/c29f5d48..ba43c6b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=01-02 Stats: 67 lines in 5 files changed: 1 ins; 59 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10695/head:pull/10695 PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Tue Oct 18 14:50:31 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 14:50:31 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v3] In-Reply-To: <-sdsXTMvVcizkj8iRuAroLF3rTCTX_TC0kqjrbC1AhQ=.5121a3a4-07ef-41ec-9c5d-cd7144df8c9f@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> <-sdsXTMvVcizkj8iRuAroLF3rTCTX_TC0kqjrbC1AhQ=.5121a3a4-07ef-41ec-9c5d-cd7144df8c9f@github.com> Message-ID: On Fri, 14 Oct 2022 09:02:29 GMT, Roberto Casta?eda Lozano wrote: > Thanks for implementing this useful feature, Christian! I have tried it out again for my use case ([barrier elision tests for generational ZGC](https://github.com/robcasloz/zgc/tree/barrier-elision-tests)) and it works fine. I have also tested different combinations of phases and IR nodes and the tests pass/fail as expected. Nice that you added extensive tests for the IR framework itself (`test/hotspot/jtreg/testlibrary_tests/ir_framework`). I only have some minor comments and suggestions. Thanks a lot Roberto for your review and re-testing the current state again with your ZGC branch and playing around with different compile phases! I've pushed updates with your review comments. ------------- PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Tue Oct 18 15:18:39 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 15:18:39 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v4] In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: <79jNFfFco0pTukz0QO8XY12CUgP9HsXkDmgOCaW3BLU=.81e1dcf1-7560-4ea3-94d6-287ead53a54d@github.com> > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Hao's patch to address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10695/files - new: https://git.openjdk.org/jdk/pull/10695/files/ba43c6b7..bdbd6917 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=02-03 Stats: 84 lines in 7 files changed: 61 ins; 0 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/10695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10695/head:pull/10695 PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Tue Oct 18 15:18:42 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 15:18:42 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Mon, 17 Oct 2022 08:15:11 GMT, Hao Sun wrote: > Hi, > > 1. For the following two files, the copyright year should be updated to 2022. > > ``` > src/hotspot/share/opto/phasetype.hpp > test/hotspot/jtreg/compiler/lib/ir_framework/driver/FlagVMProcess.java > ``` Good catch! I've updated them. > 2. I tested this PR on one SVE supporting machine and found the IR verification failed for the following 4 cases. Mainly because the IR nodes are not created for these SVE related rules. > > ``` > test/hotspot/jtreg/compiler/vectorapi/AllBitsSetVectorMatchRuleTest.java > test/hotspot/jtreg/compiler/vectorapi/VectorFusedMultiplyAddSubTest.java > test/hotspot/jtreg/compiler/vectorapi/VectorGatherScatterTest.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java > ``` Thanks a lot Hao for running additional testing to catch these missing test updates which our CI testing did not cover. The patch looks good, thanks for providing the required updates! I've pushed the changes. ------------- PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Tue Oct 18 15:18:42 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 15:18:42 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Tue, 18 Oct 2022 15:15:01 GMT, Christian Hagedorn wrote: > Good. Thanks Vladimir for your review! ------------- PR: https://git.openjdk.org/jdk/pull/10695 From shade at openjdk.org Tue Oct 18 15:28:00 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Oct 2022 15:28:00 GMT Subject: RFR: 8295469: S390X: Optimized builds are broken In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 11:53:20 GMT, Aleksey Shipilev wrote: > * For target hotspot_variant-server_libjvm_objs_BUILD_LIBJVM_link: Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10743 From shade at openjdk.org Tue Oct 18 15:30:47 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Oct 2022 15:30:47 GMT Subject: Integrated: 8295469: S390X: Optimized builds are broken In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 11:53:20 GMT, Aleksey Shipilev wrote: > * For target hotspot_variant-server_libjvm_objs_BUILD_LIBJVM_link: This pull request has now been integrated. Changeset: 7b2e83b3 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/7b2e83b3955c034208325ea5477afd3c5e1da41a Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod 8295469: S390X: Optimized builds are broken Reviewed-by: stuefe ------------- PR: https://git.openjdk.org/jdk/pull/10743 From chagedorn at openjdk.org Tue Oct 18 15:40:57 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Oct 2022 15:40:57 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v20] In-Reply-To: References: Message-ID: On Mon, 19 Sep 2022 07:53:04 GMT, Tobias Holenstein wrote: >> src/utils/IdealGraphVisualizer/Data/src/main/java/com/sun/hotspot/igv/data/InputNode.java line 34: >> >>> 32: public class InputNode extends Properties.Entity { >>> 33: >>> 34: private int id; >> >> While cleaning this class up anyways: Feels like a node id should probably not change anymore once it's set. Can this be turned into a `final` field? Looks like `setId()` is only called from this class and once from another class when creating a new input node anyways. > > I think it is called in `Difference.java` as well: `n2.setId(curIndex);` , right? Yes, that's true. You could change that code to pass the index to the `InputNode` constructor: // Find new ID for node of b, does not change the id property while (graph.getNode(curIndex) != null) { curIndex++; } InputNode n2 = new InputNode(n, curIndex); But I leave it up to you if you also want to refactor that or not - it's not directly related to your changes :-) ------------- PR: https://git.openjdk.org/jdk/pull/10197 From dlong at openjdk.org Tue Oct 18 19:00:53 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 18 Oct 2022 19:00:53 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file Message-ID: The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. ------------- Commit messages: - update test - match iRegP_R5 as iRegP Changes: https://git.openjdk.org/jdk/pull/10749/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10749&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295414 Stats: 5 lines in 2 files changed: 4 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10749.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10749/head:pull/10749 PR: https://git.openjdk.org/jdk/pull/10749 From kvn at openjdk.org Tue Oct 18 19:10:57 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Oct 2022 19:10:57 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 18:52:43 GMT, Dean Long wrote: > The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. > Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. > The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. Good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10749 From aturbanov at openjdk.org Tue Oct 18 20:28:08 2022 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Tue, 18 Oct 2022 20:28:08 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 10:12:42 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Modified JTREG test to include feature constraints test/hotspot/jtreg/compiler/vectorization/TestEor3AArch64.java line 44: > 42: public class TestEor3AArch64 { > 43: > 44: private final static int LENGTH = 2048; Suggestion: private static final int LENGTH = 2048; test/hotspot/jtreg/compiler/vectorization/TestEor3AArch64.java line 45: > 43: > 44: private final static int LENGTH = 2048; > 45: private final static Random RD = Utils.getRandomInstance(); let's use blessed modifiers order Suggestion: private static final Random RD = Utils.getRandomInstance(); ------------- PR: https://git.openjdk.org/jdk/pull/10407 From sviswanathan at openjdk.org Tue Oct 18 21:35:03 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 18 Oct 2022 21:35:03 GMT Subject: RFR: 8292761: x86: Clone nodes to match complex rules [v2] In-Reply-To: References: Message-ID: On Sat, 17 Sep 2022 12:23:35 GMT, Quan Anh Mai wrote: >> Please include the benchmark in the patch. Could you show the generated code before/after? Thanks! > > Thank @TobiHartmann @chhagedorn for your comments, I have updated the PR to address those. @merykitty Could you please add benchmark for lea n, [a + b + i]? ------------- PR: https://git.openjdk.org/jdk/pull/9977 From sviswanathan at openjdk.org Tue Oct 18 22:40:09 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 18 Oct 2022 22:40:09 GMT Subject: RFR: 8292761: x86: Clone nodes to match complex rules [v4] In-Reply-To: <7lozKkn8Du15iEOwGnNL9uk9atr8L3RcvSvgeAYooVA=.639c1d53-9323-4aba-884d-0df278ee07f8@github.com> References: <7lozKkn8Du15iEOwGnNL9uk9atr8L3RcvSvgeAYooVA=.639c1d53-9323-4aba-884d-0df278ee07f8@github.com> Message-ID: On Sat, 17 Sep 2022 09:42:41 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch tries to clone a node if it can be matched as a part of a BMI and lea pattern. This may reduce the live range of a local or remove that local completely. >> >> Please take a look and have some reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > address comments src/hotspot/cpu/x86/x86.ad line 2504: > 2502: Node* other = n->in((n->in(1) == m) ? 2 : 1); > 2503: m = other; > 2504: } The caller pd_clone_node is checking if m needs to be cloned, but we are changing m here locally before doing the pattern match? The original code was not doing this. Would it now result in cloning the wrong node? ------------- PR: https://git.openjdk.org/jdk/pull/9977 From sviswanathan at openjdk.org Tue Oct 18 23:25:10 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 18 Oct 2022 23:25:10 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: <523ASDMlZe7mAZaBQe3ipxBLaLum7_XZqLLUUgsCJi0=.db28f521-c957-4fb2-8dcc-7c09d46189e3@github.com> On Wed, 5 Oct 2022 21:28:26 GMT, vpaprotsk wrote: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 262: > 260: private static void processMultipleBlocks(byte[] input, int offset, int length, byte[] aBytes, byte[] rBytes) { > 261: MutableIntegerModuloP A = ipl1305.getElement(aBytes).mutable(); > 262: MutableIntegerModuloP R = ipl1305.getElement(rBytes).mutable(); R doesn't need to be mutable. src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 286: > 284: * numeric values. > 285: */ > 286: private void setRSVals() { //throws InvalidKeyException { The R and S check for invalid key (all bytes zero) could be submitted as a separate PR. It is not related to the Poly1305 acceleration. test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/java.base/com/sun/crypto/provider/Poly1305IntrinsicFuzzTest.java line 39: > 37: public static void main(String[] args) throws Exception { > 38: //Note: it might be useful to increase this number during development of new Poly1305 intrinsics > 39: final int repeat = 100; Should we increase this repeat count for the c2 compiler to kick in for compiling engineUpdate() and have the call to stub in place from there? test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/java.base/com/sun/crypto/provider/Poly1305KAT.java line 133: > 131: System.out.println("*** Test " + ++testNumber + ": " + > 132: test.testName); > 133: if (runSingleTest(test)) { runSingleTest may need to be called enough number of times for the engineUpdate to be compiled by c2. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From haosun at openjdk.org Wed Oct 19 01:57:54 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 19 Oct 2022 01:57:54 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v4] In-Reply-To: <79jNFfFco0pTukz0QO8XY12CUgP9HsXkDmgOCaW3BLU=.81e1dcf1-7560-4ea3-94d6-287ead53a54d@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> <79jNFfFco0pTukz0QO8XY12CUgP9HsXkDmgOCaW3BLU=.81e1dcf1-7560-4ea3-94d6-287ead53a54d@github.com> Message-ID: On Tue, 18 Oct 2022 15:18:39 GMT, Christian Hagedorn wrote: >> This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: >> >> https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 >> >> The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. >> >> ## How does it work? >> >> ### Basic idea >> There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: >> >> >> int iFld; >> >> @Test >> @IR(counts = {IRNode.STORE_I, "1"}, >> phase = {CompilePhase.AFTER_PARSING, // Fails >> CompilePhase.ITER_GVN1}) // Works >> public void optimizeStores() { >> iFld = 42; >> iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 >> } >> >> In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: >> >> 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" >> > Phase "After Parsing": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" >> - Failed comparison: [found] 2 = 1 [given] >> - Matched nodes (2): >> * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) >> * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) >> >> >> More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. >> >> ### CompilePhase.DEFAULT - default compile phase >> The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). >> >> Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. >> >> Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. >> >> ### Different regexes for the same IRNode entry >> A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: >> >> - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node >> public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node >> >> - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; >> static { >> String idealIndependentRegex = START + "Allocate" + MID + END; >> String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; >> allocNodes(ALLOC, idealIndependentRegex, optoRegex); >> } >> >> **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** >> >> ### Using the IRNode entries correctly >> The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: >> - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). >> - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). >> - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. >> >> ## General Changes >> The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: >> >> - Added more packages to better group related classes together. >> - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. >> - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). >> - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) >> - Cleaned up and refactored a lot of code to use this new design. >> - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. >> - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. >> - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. >> - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. >> - Replaced implementation inheritance by interfaces. >> - Improved encapsulation of object data. >> - Updated README and many comments/class descriptions to reflect this new feature. >> - Added new IR framework tests >> >> ## Testing >> - Normal tier testing. >> - Applying the patch to Valhalla to perform tier testing. >> - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Hao's patch to address review comments Hi, I tested this PR based on the latest code in master branch, and found the following IR verification error on AArch64 machine. test/hotspot/jtreg/compiler/c2/irTests/TestVectorConditionalMove.java:178: error: cannot find symbol @IR(failOn = {IRNode.CMOVEVD}) ^ symbol: variable CMOVEVD location: class IRNode This is because `TestVectorConditionalMove.java` case was updated recently, but it's not included/updated accordingly in this PR. See https://github.com/openjdk/jdk/commit/490fcd0c2547cb. Hence, I suggest you rebasing this patch and rerun the testing. Note that only `TestVectorConditionalMove.java` is the only new failure in my local test (i.e. tier1~3) on both 1) AArch64 machine and 2) SVE machine. ------------- PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Wed Oct 19 06:27:49 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 06:27:49 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 18:52:43 GMT, Dean Long wrote: > The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. > Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. > The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. Otherwise, looks good! test/hotspot/jtreg/compiler/types/TestSubTypeCheckMacroTrichotomy.java line 33: > 31: * > 32: * @run main/othervm -XX:-BackgroundCompilation TestSubTypeCheckMacroTrichotomy > 33: * @run main/othervm -XX:-BackgroundCompilation -XX:+StressReflectiveCode -XX:+ExpandSubTypeCheckAtParseTime You should additionally add `-XX:+UnlockDiagnosticVMOptions` since `ExpandSubTypeCheckAtParseTime` is a diagnostic VM flag. ------------- Changes requested by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10749 From dlong at openjdk.org Wed Oct 19 07:36:01 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Oct 2022 07:36:01 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: > The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. > Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. > The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. Dean Long has updated the pull request incrementally with one additional commit since the last revision: allow test to run with release builds ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10749/files - new: https://git.openjdk.org/jdk/pull/10749/files/87445222..e472a20e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10749&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10749&range=00-01 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10749.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10749/head:pull/10749 PR: https://git.openjdk.org/jdk/pull/10749 From dlong at openjdk.org Wed Oct 19 07:36:03 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Oct 2022 07:36:03 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 06:22:52 GMT, Christian Hagedorn wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> allow test to run with release builds > > test/hotspot/jtreg/compiler/types/TestSubTypeCheckMacroTrichotomy.java line 33: > >> 31: * >> 32: * @run main/othervm -XX:-BackgroundCompilation TestSubTypeCheckMacroTrichotomy >> 33: * @run main/othervm -XX:-BackgroundCompilation -XX:+StressReflectiveCode -XX:+ExpandSubTypeCheckAtParseTime > > You should additionally add `-XX:+UnlockDiagnosticVMOptions` since `ExpandSubTypeCheckAtParseTime` is a diagnostic VM flag. Thanks for catching that. Unfortunately, it also requires a debug build, but if I add @requires vm.debug, then the tests won't run with release builds. I went with the next best option: -XX:+IgnoreUnrecognizedVMOptions. ------------- PR: https://git.openjdk.org/jdk/pull/10749 From jbhateja at openjdk.org Wed Oct 19 07:46:01 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 19 Oct 2022 07:46:01 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector [v2] In-Reply-To: References: Message-ID: <2QQplnsxNj7THlLJteQ3EfmBcqNnSqH4he4vso9PjLk=.2be4a7e8-3165-479b-8dce-57a312e54ae1@github.com> On Tue, 18 Oct 2022 01:44:21 GMT, Xiaohong Gong wrote: >> "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. >> >> This patch adds the vector intrinsic implementation of it. The steps are: >> >> 1) Load the const "iota" vector. >> >> We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. >> >> 2) Compute indexes with "`vec + iota * scale`" >> >> Here is the performance result to the new added micro benchmark on ARM NEON: >> >> Benchmark Gain >> IndexVectorBenchmark.byteIndexVector 1.477 >> IndexVectorBenchmark.doubleIndexVector 5.031 >> IndexVectorBenchmark.floatIndexVector 5.342 >> IndexVectorBenchmark.intIndexVector 5.529 >> IndexVectorBenchmark.longIndexVector 3.177 >> IndexVectorBenchmark.shortIndexVector 5.841 >> >> >> Please help to review and share the feedback! Thanks in advance! > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Add the floating point support for VectorLoadConst and remove the VectorCast > - Merge branch 'master' into JDK-8293409 > - 8293409: [vectorapi] Intrinsify VectorSupport.indexVector Hi @XiaohongGong , patch now shows significant gains on both AVX512 and legacy X86 targets. X86 and common IR changes LGTM, thanks! ------------- Marked as reviewed by jbhateja (Reviewer). PR: https://git.openjdk.org/jdk/pull/10332 From xgong at openjdk.org Wed Oct 19 07:50:13 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 19 Oct 2022 07:50:13 GMT Subject: RFR: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector [v2] In-Reply-To: <2QQplnsxNj7THlLJteQ3EfmBcqNnSqH4he4vso9PjLk=.2be4a7e8-3165-479b-8dce-57a312e54ae1@github.com> References: <2QQplnsxNj7THlLJteQ3EfmBcqNnSqH4he4vso9PjLk=.2be4a7e8-3165-479b-8dce-57a312e54ae1@github.com> Message-ID: <5DXiOa_G0UGgdQwT3hY3SJtrV48jKRtjtXdjO7nLNL8=.d9497a92-ae5d-43da-bf39-1c53106e74f1@github.com> On Wed, 19 Oct 2022 07:43:33 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Add the floating point support for VectorLoadConst and remove the VectorCast >> - Merge branch 'master' into JDK-8293409 >> - 8293409: [vectorapi] Intrinsify VectorSupport.indexVector > > Hi @XiaohongGong , patch now shows significant gains on both AVX512 and legacy X86 targets. > > X86 and common IR changes LGTM, thanks! Thanks for the review @jatin-bhateja @theRealELiu ! ------------- PR: https://git.openjdk.org/jdk/pull/10332 From chagedorn at openjdk.org Wed Oct 19 08:19:16 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 08:19:16 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v5] In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 83 commits: - Fix TestVectorConditionalMove - Merge branch 'master' into JDK-8280378 - Hao's patch to address review comments - Roberto's review comments - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java Co-authored-by: Roberto Casta?eda Lozano - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java Co-authored-by: Roberto Casta?eda Lozano - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java Co-authored-by: Roberto Casta?eda Lozano - Merge branch 'master' into JDK-8280378 - Fix missing counts indentation in failure messages - Update comments - ... and 73 more: https://git.openjdk.org/jdk/compare/f502ab85...ae7190c4 ------------- Changes: https://git.openjdk.org/jdk/pull/10695/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=04 Stats: 9484 lines in 154 files changed: 7149 ins; 1598 del; 737 mod Patch: https://git.openjdk.org/jdk/pull/10695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10695/head:pull/10695 PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Wed Oct 19 08:26:14 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 08:26:14 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v5] In-Reply-To: References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: <6Uf_te66EeixH3RjKRaYDOXIgrIB3czPfyt2V3A7FVU=.68f8baf0-60f1-4cc5-8ed6-efe4d8b4c825@github.com> On Wed, 19 Oct 2022 08:19:16 GMT, Christian Hagedorn wrote: >> This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: >> >> https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 >> >> The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. >> >> ## How does it work? >> >> ### Basic idea >> There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: >> >> >> int iFld; >> >> @Test >> @IR(counts = {IRNode.STORE_I, "1"}, >> phase = {CompilePhase.AFTER_PARSING, // Fails >> CompilePhase.ITER_GVN1}) // Works >> public void optimizeStores() { >> iFld = 42; >> iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 >> } >> >> In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: >> >> 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" >> > Phase "After Parsing": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" >> - Failed comparison: [found] 2 = 1 [given] >> - Matched nodes (2): >> * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) >> * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) >> >> >> More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. >> >> ### CompilePhase.DEFAULT - default compile phase >> The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). >> >> Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. >> >> Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. >> >> ### Different regexes for the same IRNode entry >> A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: >> >> - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node >> public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node >> >> - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; >> static { >> String idealIndependentRegex = START + "Allocate" + MID + END; >> String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; >> allocNodes(ALLOC, idealIndependentRegex, optoRegex); >> } >> >> **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** >> >> ### Using the IRNode entries correctly >> The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: >> - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). >> - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). >> - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. >> >> ## General Changes >> The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: >> >> - Added more packages to better group related classes together. >> - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. >> - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). >> - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) >> - Cleaned up and refactored a lot of code to use this new design. >> - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. >> - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. >> - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. >> - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. >> - Replaced implementation inheritance by interfaces. >> - Improved encapsulation of object data. >> - Updated README and many comments/class descriptions to reflect this new feature. >> - Added new IR framework tests >> >> ## Testing >> - Normal tier testing. >> - Applying the patch to Valhalla to perform tier testing. >> - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 83 commits: > > - Fix TestVectorConditionalMove > - Merge branch 'master' into JDK-8280378 > - Hao's patch to address review comments > - Roberto's review comments > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java > > Co-authored-by: Roberto Casta?eda Lozano > - Merge branch 'master' into JDK-8280378 > - Fix missing counts indentation in failure messages > - Update comments > - ... and 73 more: https://git.openjdk.org/jdk/compare/f502ab85...ae7190c4 Thanks for re-running testing again! I've merged latest master and fixed the test. Once reviews are completed, I will merge master one more time before integration and re-submit some testing as well. ------------- PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Wed Oct 19 08:34:14 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 08:34:14 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 07:36:01 GMT, Dean Long wrote: >> The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. >> Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. >> The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > allow test to run with release builds Marked as reviewed by chagedorn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10749 From chagedorn at openjdk.org Wed Oct 19 08:34:20 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 08:34:20 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 07:30:36 GMT, Dean Long wrote: >> test/hotspot/jtreg/compiler/types/TestSubTypeCheckMacroTrichotomy.java line 33: >> >>> 31: * >>> 32: * @run main/othervm -XX:-BackgroundCompilation TestSubTypeCheckMacroTrichotomy >>> 33: * @run main/othervm -XX:-BackgroundCompilation -XX:+StressReflectiveCode -XX:+ExpandSubTypeCheckAtParseTime >> >> You should additionally add `-XX:+UnlockDiagnosticVMOptions` since `ExpandSubTypeCheckAtParseTime` is a diagnostic VM flag. > > Thanks for catching that. Unfortunately, it also requires a debug build, but if I add @requires vm.debug, then the tests won't run with release builds. I went with the next best option: -XX:+IgnoreUnrecognizedVMOptions. Yes, you're right. I agree that this is the best solution in that case. ------------- PR: https://git.openjdk.org/jdk/pull/10749 From xgong at openjdk.org Wed Oct 19 09:28:04 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 19 Oct 2022 09:28:04 GMT Subject: Integrated: 8293409: [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Mon, 19 Sep 2022 08:51:24 GMT, Xiaohong Gong wrote: > "`VectorSupport.indexVector()`" is used to compute a vector that contains the index values based on a given vector and a scale value (`i.e. index = vec + iota * scale`). This function is widely used in other APIs like "`VectorMask.indexInRange`" which is useful to the tail loop vectorization. And it can be easily implemented with the vector instructions. > > This patch adds the vector intrinsic implementation of it. The steps are: > > 1) Load the const "iota" vector. > > We extend the "`vector_iota_indices`" stubs from byte to other integral types. For floating point vectors, it needs an additional vector cast to get the right iota values. > > 2) Compute indexes with "`vec + iota * scale`" > > Here is the performance result to the new added micro benchmark on ARM NEON: > > Benchmark Gain > IndexVectorBenchmark.byteIndexVector 1.477 > IndexVectorBenchmark.doubleIndexVector 5.031 > IndexVectorBenchmark.floatIndexVector 5.342 > IndexVectorBenchmark.intIndexVector 5.529 > IndexVectorBenchmark.longIndexVector 3.177 > IndexVectorBenchmark.shortIndexVector 5.841 > > > Please help to review and share the feedback! Thanks in advance! This pull request has now been integrated. Changeset: 857b0f9b Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/857b0f9b05bc711f3282a0da85fcff131fffab91 Stats: 391 lines in 14 files changed: 361 ins; 9 del; 21 mod 8293409: [vectorapi] Intrinsify VectorSupport.indexVector Reviewed-by: eliu, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/10332 From chagedorn at openjdk.org Wed Oct 19 11:58:12 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 11:58:12 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 20:34:11 GMT, Joshua Cao wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Set CompileTask directive on initialization, add tests for > PrintCompilation Otherwise, looks good! src/hotspot/share/compiler/compileTask.hpp line 130: > 128: bool is_blocking() const { return _is_blocking; } > 129: bool is_success() const { return _is_success; } > 130: void set_directive(DirectiveSet* directive) { _directive = directive; } This method is now unused and can be removed. src/hotspot/share/compiler/compileTask.hpp line 131: > 129: bool is_success() const { return _is_success; } > 130: void set_directive(DirectiveSet* directive) { _directive = directive; } > 131: DirectiveSet* directive() { return _directive; } The method can still be `const` since we are not changing `this`. Suggestion: DirectiveSet* directive() const { return _directive; } test/hotspot/jtreg/compiler/print/CompileCommandPrintCompilation.java line 30: > 28: * @library /test/lib > 29: * @modules java.base/jdk.internal.misc > 30: * java.management These modules are not used and can be removed. Same for the other test. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10668 From chagedorn at openjdk.org Wed Oct 19 12:36:22 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 12:36:22 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v20] In-Reply-To: <0FbySLoyt8bfmd94aMZ2cHKwjWEqZqjeWY6OGZPx6x8=.f34a5aa9-6941-44db-86b0-09f11cb31267@github.com> References: <0FbySLoyt8bfmd94aMZ2cHKwjWEqZqjeWY6OGZPx6x8=.f34a5aa9-6941-44db-86b0-09f11cb31267@github.com> Message-ID: On Mon, 17 Oct 2022 15:40:59 GMT, Tobias Holenstein wrote: >> Cleanup of the code in IGV without changing the functionality. >> >> - removed dead code (unused classes, functions, variables) from the IGV code base >> - merged (and removed) redundant functions >> - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV >> - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package >> - made class variables `final` whenever possible >> - removed `this.` in `this.funtion()` funciton calls when it was not needed >> - used lambdas instead of anonymous class if possible >> - fixed whitespace issues (e.g. double whitespace) >> - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` >> - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > re-add ConnectionSet class > > is needed to highlight hovered edges Otherwise, looks good! Thanks for doing the updates from my previous comments! src/utils/IdealGraphVisualizer/HierarchicalLayout/src/main/java/com/sun/hotspot/igv/hierarchicallayout/ClusterInputSlotNode.java line 105: > 103: public Dimension getSize() { > 104: int SIZE = 0; > 105: return new Dimension(SIZE, SIZE); You could directly inline `SIZE`: Suggestion: return new Dimension(0, 0); src/utils/IdealGraphVisualizer/HierarchicalLayout/src/main/java/com/sun/hotspot/igv/hierarchicallayout/ClusterOutputSlotNode.java line 106: > 104: public Dimension getSize() { > 105: int SIZE = 0; > 106: return new Dimension(SIZE, SIZE); Suggestion: return new Dimension(0, 0); src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 466: > 464: private JScrollPane getScrollPane() { > 465: return scrollPane; > 466: } You could directly replace the usages of `getScrollPane()` inside this class by a field access to `scrollPane`. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 1245: > 1243: private boolean getUndoRedoEnabled() { > 1244: return undoRedoEnabled; > 1245: } You could directly modify the field instead of going over the private setter/getter. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Wed Oct 19 12:46:26 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 19 Oct 2022 12:46:26 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v21] In-Reply-To: References: Message-ID: > Cleanup of the code in IGV without changing the functionality. > > - removed dead code (unused classes, functions, variables) from the IGV code base > - merged (and removed) redundant functions > - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV > - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package > - made class variables `final` whenever possible > - removed `this.` in `this.funtion()` funciton calls when it was not needed > - used lambdas instead of anonymous class if possible > - fixed whitespace issues (e.g. double whitespace) > - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` > - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: update Copyright year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/96425deb..6f2111e2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=20 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=19-20 Stats: 7 lines in 7 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From chagedorn at openjdk.org Wed Oct 19 14:01:49 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 14:01:49 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v22] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 13:05:56 GMT, Tobias Holenstein wrote: >> Cleanup of the code in IGV without changing the functionality. >> >> - removed dead code (unused classes, functions, variables) from the IGV code base >> - merged (and removed) redundant functions >> - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV >> - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package >> - made class variables `final` whenever possible >> - removed `this.` in `this.funtion()` funciton calls when it was not needed >> - used lambdas instead of anonymous class if possible >> - fixed whitespace issues (e.g. double whitespace) >> - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` >> - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity > > Tobias Holenstein has updated the pull request incrementally with five additional commits since the last revision: > > - Merge branch 'JDK-8290011' of github.com:tobiasholenstein/jdk into JDK-8290011 > - inline SIZE > > Co-authored-by: Christian Hagedorn > - remove setUndoRedoEnabled() and getUndoRedoEnabled() > - remove getScrollPane() > - inline SIZE > > Co-authored-by: Christian Hagedorn Updates look good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Wed Oct 19 14:01:48 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 19 Oct 2022 14:01:48 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v22] In-Reply-To: References: Message-ID: > Cleanup of the code in IGV without changing the functionality. > > - removed dead code (unused classes, functions, variables) from the IGV code base > - merged (and removed) redundant functions > - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV > - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package > - made class variables `final` whenever possible > - removed `this.` in `this.funtion()` funciton calls when it was not needed > - used lambdas instead of anonymous class if possible > - fixed whitespace issues (e.g. double whitespace) > - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` > - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity Tobias Holenstein has updated the pull request incrementally with five additional commits since the last revision: - Merge branch 'JDK-8290011' of github.com:tobiasholenstein/jdk into JDK-8290011 - inline SIZE Co-authored-by: Christian Hagedorn - remove setUndoRedoEnabled() and getUndoRedoEnabled() - remove getScrollPane() - inline SIZE Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10197/files - new: https://git.openjdk.org/jdk/pull/10197/files/6f2111e2..d30d693b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10197&range=20-21 Stats: 53 lines in 3 files changed: 4 ins; 21 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/10197.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10197/head:pull/10197 PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Wed Oct 19 14:01:53 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 19 Oct 2022 14:01:53 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v20] In-Reply-To: References: <0FbySLoyt8bfmd94aMZ2cHKwjWEqZqjeWY6OGZPx6x8=.f34a5aa9-6941-44db-86b0-09f11cb31267@github.com> Message-ID: On Wed, 19 Oct 2022 12:17:35 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> re-add ConnectionSet class >> >> is needed to highlight hovered edges > > src/utils/IdealGraphVisualizer/HierarchicalLayout/src/main/java/com/sun/hotspot/igv/hierarchicallayout/ClusterOutputSlotNode.java line 106: > >> 104: public Dimension getSize() { >> 105: int SIZE = 0; >> 106: return new Dimension(SIZE, SIZE); > > Suggestion: > > return new Dimension(0, 0); done > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 466: > >> 464: private JScrollPane getScrollPane() { >> 465: return scrollPane; >> 466: } > > You could directly replace the usages of `getScrollPane()` inside this class by a field access to `scrollPane`. done > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 1245: > >> 1243: private boolean getUndoRedoEnabled() { >> 1244: return undoRedoEnabled; >> 1245: } > > You could directly modify the field instead of going over the private setter/getter. done ------------- PR: https://git.openjdk.org/jdk/pull/10197 From qamai at openjdk.org Wed Oct 19 14:15:33 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 19 Oct 2022 14:15:33 GMT Subject: RFR: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" Message-ID: This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. Thanks a lot. ------------- Commit messages: - fix wrongly used evmovdqub Changes: https://git.openjdk.org/jdk/pull/10764/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10764&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295662 Stats: 12 lines in 1 file changed: 0 ins; 11 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10764.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10764/head:pull/10764 PR: https://git.openjdk.org/jdk/pull/10764 From dcubed at openjdk.org Wed Oct 19 14:20:14 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 14:20:14 GMT Subject: RFR: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: <0XQbvWCNFEyWgyG78LqW28xJyhR8mxvElIIjsKrcNUc=.aecf06e0-c0e6-4c0a-9d0b-e663f8677cea@github.com> References: <0XQbvWCNFEyWgyG78LqW28xJyhR8mxvElIIjsKrcNUc=.aecf06e0-c0e6-4c0a-9d0b-e663f8677cea@github.com> Message-ID: On Wed, 19 Oct 2022 14:12:16 GMT, Tobias Hartmann wrote: >> This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. >> >> This [BACKOUT] was created via: >> `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` >> and there were no conflicts detected. > > Looks good. @TobiHartmann and @chhagedorn - Thanks for the fast reviews! I've kicked off a sanity check Tier1... @merykitty - It's not clear what testing you've already done on https://github.com/openjdk/jdk/pull/10764 ------------- PR: https://git.openjdk.org/jdk/pull/10765 From bkilambi at openjdk.org Wed Oct 19 14:27:34 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 19 Oct 2022 14:27:34 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v3] In-Reply-To: References: Message-ID: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Changed the modifier order preference in JTREG test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10407/files - new: https://git.openjdk.org/jdk/pull/10407/files/6df4f014..449524ad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10407.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10407/head:pull/10407 PR: https://git.openjdk.org/jdk/pull/10407 From bkilambi at openjdk.org Wed Oct 19 14:27:39 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 19 Oct 2022 14:27:39 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v2] In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 20:24:07 GMT, Andrey Turbanov wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Modified JTREG test to include feature constraints > > test/hotspot/jtreg/compiler/vectorization/TestEor3AArch64.java line 44: > >> 42: public class TestEor3AArch64 { >> 43: >> 44: private final static int LENGTH = 2048; > > Suggestion: > > private static final int LENGTH = 2048; Hello, thank you for your feedback. I have made the suggested changes and uploaded a new patch. Please review .. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From chagedorn at openjdk.org Wed Oct 19 14:30:05 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 14:30:05 GMT Subject: RFR: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:08:39 GMT, Quan Anh Mai wrote: > This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. > > Thanks a lot. Thanks for proposing a fix that quickly, looks reasonable! I've submitted some testing (see comments in #10765). ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10764 From chagedorn at openjdk.org Wed Oct 19 14:30:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 14:30:07 GMT Subject: RFR: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:13:38 GMT, Quan Anh Mai wrote: >> This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. >> >> This [BACKOUT] was created via: >> `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` >> and there were no conflicts detected. > > Can you check if #10764 fix the issue? Thanks a lot. @merykitty's fix looks reasonable. I've started a tier3 testing where we've seen these failures. I've also started a sanity tier1+2 testing. If that looks clean, we could go with @merykitty fix instead. ------------- PR: https://git.openjdk.org/jdk/pull/10765 From thartmann at openjdk.org Wed Oct 19 14:33:21 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 19 Oct 2022 14:33:21 GMT Subject: RFR: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" In-Reply-To: References: Message-ID: <7Yi0NO1BL_SqwIU_4WQNvUZL32wrg9rCcK2upoTkV30=.a0b684ec-b386-4de5-a371-ce72a857357c@github.com> On Wed, 19 Oct 2022 14:08:39 GMT, Quan Anh Mai wrote: > This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. > > Thanks a lot. Looks reasonable to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10764 From dcubed at openjdk.org Wed Oct 19 14:38:05 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 14:38:05 GMT Subject: RFR: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:08:55 GMT, Daniel D. Daugherty wrote: > This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. > > This [BACKOUT] was created via: > `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` > and there were no conflicts detected. Sounds like a good plan to me. ------------- PR: https://git.openjdk.org/jdk/pull/10765 From duke at openjdk.org Wed Oct 19 15:39:25 2022 From: duke at openjdk.org (Joshua Cao) Date: Wed, 19 Oct 2022 15:39:25 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v3] In-Reply-To: References: Message-ID: > Example: > > > [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello > CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true > 223 12 3 java.lang.String::length (11 bytes) > 405 307 4 java.lang.String::length (11 bytes) > hello world > > > Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. > > --- > > Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. > > I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Remove unused methods ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10668/files - new: https://git.openjdk.org/jdk/pull/10668/files/2eb4c9c8..1b4d6f0c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10668&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10668&range=01-02 Stats: 6 lines in 3 files changed: 0 ins; 5 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10668.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10668/head:pull/10668 PR: https://git.openjdk.org/jdk/pull/10668 From duke at openjdk.org Wed Oct 19 15:40:59 2022 From: duke at openjdk.org (Joshua Cao) Date: Wed, 19 Oct 2022 15:40:59 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 11:54:23 GMT, Christian Hagedorn wrote: > Otherwise, looks good! Thanks for the review. I've added all your suggestions. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From dcubed at openjdk.org Wed Oct 19 15:44:01 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 15:44:01 GMT Subject: RFR: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:08:55 GMT, Daniel D. Daugherty wrote: > This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. > > This [BACKOUT] was created via: > `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` > and there were no conflicts detected. My Tier1 testing for the [BACKOUT] has passed with no failures. I'm still in a holding pattern for integration of the [BACKOUT] for now. ------------- PR: https://git.openjdk.org/jdk/pull/10765 From tholenstein at openjdk.org Wed Oct 19 15:50:08 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 19 Oct 2022 15:50:08 GMT Subject: RFR: JDK-8290011: IGV: Remove dead code and cleanup [v18] In-Reply-To: <27ETKLx8aWB7ixXbmfDXnGFFcFQNNnOLx9WHvPDQPNQ=.17cf231d-c617-4a15-ac96-077995b0e9ce@github.com> References: <27ETKLx8aWB7ixXbmfDXnGFFcFQNNnOLx9WHvPDQPNQ=.17cf231d-c617-4a15-ac96-077995b0e9ce@github.com> Message-ID: On Tue, 4 Oct 2022 16:57:56 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> re-add hideDuplicates.png > > Thanks for addressing my comments, Tobias! > > I tested the changeset manually and hit the following exception when placing the mouse pointer on graph edges to show their tooltips: > > > [INFO] java.lang.ClassCastException: class com.sun.hotspot.igv.view.widgets.InputSlotWidget cannot be cast to class com.sun.hotspot.igv.graph.Figure (com.sun.hotspot.igv.view.widgets.InputSlotWidget is in unnamed module of loader org.netbeans.StandardModule$OneModuleClassLoader @17b716f7; com.sun.hotspot.igv.graph.Figure is in unnamed module of loader org.netbeans.StandardModule$OneModuleClassLoader @6fcf432a) > [INFO] at com.sun.hotspot.igv.view.widgets.LineWidget$1.select(LineWidget.java:142) > [INFO] at org.netbeans.modules.visual.action.SelectAction.mouseReleased(SelectAction.java:86) > [INFO] at org.netbeans.api.visual.widget.SceneComponent$Operator$3.operate(SceneComponent.java:535) > [INFO] at org.netbeans.api.visual.widget.SceneComponent.processLocationOperator(SceneComponent.java:250) > [INFO] at org.netbeans.api.visual.widget.SceneComponent.mouseReleased(SceneComponent.java:137) > [INFO] at java.desktop/java.awt.AWTEventMulticaster.mouseReleased(AWTEventMulticaster.java:297) > [INFO] at java.desktop/java.awt.Component.processMouseEvent(Component.java:6635) > [INFO] at java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) > [INFO] at java.desktop/java.awt.Component.processEvent(Component.java:6400) > [INFO] at java.desktop/java.awt.Container.processEvent(Container.java:2263) > [INFO] at java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5011) > [INFO] at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) > [INFO] at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) > [INFO] at java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4918) > [INFO] at java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4547) > [INFO] at java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4488) > [INFO] at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) > [INFO] at java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2772) > [INFO] at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) > [INFO] at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) > [INFO] at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) > [INFO] at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) > [INFO] at java.base/java.security.AccessController.doPrivileged(Native Method) > [INFO] at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) > [INFO] at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95) > [INFO] at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) > [INFO] at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) > [INFO] at java.base/java.security.AccessController.doPrivileged(Native Method) > [INFO] at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) > [INFO] at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) > [INFO] at org.netbeans.core.TimableEventQueue.dispatchEvent(TimableEventQueue.java:136) > [INFO] [catch] at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) > [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) > [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) > [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) > [INFO] at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) > [INFO] at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90) @robcasloz , @turbanoff and @chhagedorn thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10197 From tholenstein at openjdk.org Wed Oct 19 15:53:11 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 19 Oct 2022 15:53:11 GMT Subject: Integrated: JDK-8290011: IGV: Remove dead code and cleanup In-Reply-To: References: Message-ID: On Wed, 7 Sep 2022 11:45:45 GMT, Tobias Holenstein wrote: > Cleanup of the code in IGV without changing the functionality. > > - removed dead code (unused classes, functions, variables) from the IGV code base > - merged (and removed) redundant functions > - added explicit position arguments to `layer.xml` - This avoids the position warning during building of IGV > - ordered the inputs alphabetically, and used wildcards if >= 5 imports of a particular package > - made class variables `final` whenever possible > - removed `this.` in `this.funtion()` funciton calls when it was not needed > - used lambdas instead of anonymous class if possible > - fixed whitespace issues (e.g. double whitespace) > - removed not needed copy of `RangeSliderModel tempModel` in `RangeSliderModel.java` > - changed `EditorTopComponent` to take `InputGraph` as argument in constructor instead of `Diagram` and moved the creation of the `Diagram` to `DiagramViewModel.java` to increased encapsulation/modularity This pull request has now been integrated. Changeset: e27bea0c Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/e27bea0c4db282e26d0d96611bb330e02c314d48 Stats: 4529 lines in 133 files changed: 340 ins; 3375 del; 814 mod 8290011: IGV: Remove dead code and cleanup Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10197 From xliu at openjdk.org Wed Oct 19 15:59:07 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 19 Oct 2022 15:59:07 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v3] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 15:39:25 GMT, Joshua Cao wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Remove unused methods still LGTM. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/10668 From jbhateja at openjdk.org Wed Oct 19 16:30:04 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 19 Oct 2022 16:30:04 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:28:26 GMT, vpaprotsk wrote: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Some initial assembler level comments. src/hotspot/cpu/x86/assembler_x86.cpp line 5484: > 5482: > 5483: void Assembler::evpunpckhqdq(XMMRegister dst, KRegister mask, XMMRegister src1, XMMRegister src2, bool merge, int vector_len) { > 5484: assert(UseAVX > 2, "requires AVX512F"); Please replace flag with feature EVEX check. src/hotspot/cpu/x86/assembler_x86.cpp line 7831: > 7829: > 7830: void Assembler::vpandq(XMMRegister dst, XMMRegister nds, Address src, int vector_len) { > 7831: assert(VM_Version::supports_evex(), ""); Assertion should check existence of AVX512VL for non 512 but vectors. src/hotspot/cpu/x86/assembler_x86.cpp line 7958: > 7956: > 7957: void Assembler::vporq(XMMRegister dst, XMMRegister nds, Address src, int vector_len) { > 7958: assert(VM_Version::supports_evex(), ""); Same as above src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 1960: > 1958: address StubGenerator::generate_poly1305_masksCP() { > 1959: StubCodeMark mark(this, "StubRoutines", "generate_poly1305_masksCP"); > 1960: address start = __ pc(); You may use [align64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp#L777) here, like ------------- PR: https://git.openjdk.org/jdk/pull/10582 From chagedorn at openjdk.org Wed Oct 19 16:33:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 16:33:00 GMT Subject: RFR: 8295668: validate-source failure after JDK-8290011 In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 16:20:38 GMT, Daniel D. Daugherty wrote: > A trivia copyright fix for validate-source failure after JDK-8290011. Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10772 From dcubed at openjdk.org Wed Oct 19 16:33:01 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 16:33:01 GMT Subject: RFR: 8295668: validate-source failure after JDK-8290011 In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 16:28:08 GMT, Christian Hagedorn wrote: >> A trivia copyright fix for validate-source failure after JDK-8290011. > > Looks good and trivial! @chhagedorn - Thanks for the fast review. ------------- PR: https://git.openjdk.org/jdk/pull/10772 From dcubed at openjdk.org Wed Oct 19 16:35:58 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 16:35:58 GMT Subject: Integrated: 8295668: validate-source failure after JDK-8290011 In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 16:20:38 GMT, Daniel D. Daugherty wrote: > A trivia copyright fix for validate-source failure after JDK-8290011. This pull request has now been integrated. Changeset: 5eaf5686 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/5eaf5686656a10ee27977de23ed5290a723b96a8 Stats: 10 lines in 9 files changed: 0 ins; 0 del; 10 mod 8295668: validate-source failure after JDK-8290011 Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10772 From chagedorn at openjdk.org Wed Oct 19 16:41:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 16:41:00 GMT Subject: RFR: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" In-Reply-To: References: Message-ID: <6dVRd0FvqV-Bf4LPac2wLsmyXoJ0t7hajOk2Ble1MdM=.0f380d78-4787-4a44-9745-06966ee2499e@github.com> On Wed, 19 Oct 2022 14:08:39 GMT, Quan Anh Mai wrote: > This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. > > Thanks a lot. Testing looked good! Tier1 and 2 is completed and tier 3 is almost completed but the relevant tasks were successful. @merykitty you can proceed with the integration to reduce the noise in the JDK 20 CI. Thanks, Christian ------------- PR: https://git.openjdk.org/jdk/pull/10764 From chagedorn at openjdk.org Wed Oct 19 16:41:03 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 16:41:03 GMT Subject: RFR: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:13:38 GMT, Quan Anh Mai wrote: >> This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. >> >> This [BACKOUT] was created via: >> `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` >> and there were no conflicts detected. > > Can you check if #10764 fix the issue? Thanks a lot. Testing of #10764 looked good! Tier1 and 2 is completed and tier 3 is almost completed but the relevant tasks were successful. @merykitty you can proceed with the integration of #10764. Thanks, Christian ------------- PR: https://git.openjdk.org/jdk/pull/10765 From qamai at openjdk.org Wed Oct 19 16:41:01 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 19 Oct 2022 16:41:01 GMT Subject: RFR: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:08:39 GMT, Quan Anh Mai wrote: > This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. > > Thanks a lot. Thanks very much for your reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10764 From qamai at openjdk.org Wed Oct 19 16:41:05 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 19 Oct 2022 16:41:05 GMT Subject: RFR: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: <9LORcEXrNh6vMiZTYrPUWrNhNIjBJgVr66Enzrgb9Mk=.c8df746f-3a6a-406d-abfb-852f04be9ed7@github.com> On Wed, 19 Oct 2022 14:08:55 GMT, Daniel D. Daugherty wrote: > This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. > > This [BACKOUT] was created via: > `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` > and there were no conflicts detected. Thanks a lot for your testing, I will integrate now ------------- PR: https://git.openjdk.org/jdk/pull/10765 From dcubed at openjdk.org Wed Oct 19 16:44:21 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 16:44:21 GMT Subject: RFR: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" In-Reply-To: References: Message-ID: <83rzhXAKclZuA6RAyE7WSc4rskzwo44bARVUEB2JYQ0=.05a18c87-6945-4b19-a808-21082b716a2b@github.com> On Wed, 19 Oct 2022 16:37:13 GMT, Quan Anh Mai wrote: >> This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. >> >> Thanks a lot. > > Thanks very much for your reviews. @merykitty - Thanks for jumping in on these test failures quickly and for fixing the issue. I appreciate your efforts. ------------- PR: https://git.openjdk.org/jdk/pull/10764 From qamai at openjdk.org Wed Oct 19 16:44:22 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 19 Oct 2022 16:44:22 GMT Subject: Integrated: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" In-Reply-To: References: Message-ID: <3VuHnIIFK4crU935d6Vr-lYLH6GYWq2M7cACDRbWTrs=.97c982f3-2325-4b36-8f52-d9ade6506cb0@github.com> On Wed, 19 Oct 2022 14:08:39 GMT, Quan Anh Mai wrote: > This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. > > Thanks a lot. This pull request has now been integrated. Changeset: 7b1c6767 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/7b1c6767fc5ea90630776e5bfa0fcc47ffc89aa6 Stats: 12 lines in 1 file changed: 0 ins; 11 del; 1 mod 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10764 From chagedorn at openjdk.org Wed Oct 19 16:47:15 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Oct 2022 16:47:15 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v3] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 15:39:25 GMT, Joshua Cao wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Remove unused methods Thanks for doing the updates! Looks good to me, too. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10668 From dcubed at openjdk.org Wed Oct 19 16:47:18 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 16:47:18 GMT Subject: RFR: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:08:55 GMT, Daniel D. Daugherty wrote: > This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. > > This [BACKOUT] was created via: > `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` > and there were no conflicts detected. The fix for the following bug: [JDK-8295662](https://bugs.openjdk.org/browse/JDK-8295662) jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" has been reviewed and tested so we no longer need this [BACKOUT]. ------------- PR: https://git.openjdk.org/jdk/pull/10765 From dcubed at openjdk.org Wed Oct 19 16:47:18 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 19 Oct 2022 16:47:18 GMT Subject: Withdrawn: 8295665: [BACKOUT] JDK-8293409 [vectorapi] Intrinsify VectorSupport.indexVector In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:08:55 GMT, Daniel D. Daugherty wrote: > This reverts commit 857b0f9b05bc711f3282a0da85fcff131fffab91. > > This [BACKOUT] was created via: > `git revert 857b0f9b05bc711f3282a0da85fcff131fffab91` > and there were no conflicts detected. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/10765 From kvn at openjdk.org Wed Oct 19 18:16:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Oct 2022 18:16:07 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v3] In-Reply-To: References: Message-ID: <04I7tznzEdV7ivRklx4B9LMmAm8RidUBSaJSrp6z2Qo=.037fc704-332b-48da-b0a3-b96396f72a3c@github.com> On Wed, 19 Oct 2022 15:39:25 GMT, Joshua Cao wrote: >> Example: >> >> >> [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello >> CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true >> 223 12 3 java.lang.String::length (11 bytes) >> 405 307 4 java.lang.String::length (11 bytes) >> hello world >> >> >> Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. >> >> --- >> >> Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. >> >> I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Remove unused methods Latest changes passed testing. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10668 From shade at openjdk.org Wed Oct 19 18:52:01 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Oct 2022 18:52:01 GMT Subject: RFR: 8294467: Fix sequence-point warnings in Hotspot [v2] In-Reply-To: References: Message-ID: On Mon, 17 Oct 2022 18:21:48 GMT, Aleksey Shipilev wrote: >> There seem to be the only place in Hotspot where this warning fires, yet the warning is disabled wholesale for Hotspot. This is not good. >> >> I can trace the addition of sequence-point exclusion to [JDK-8211029](https://bugs.openjdk.org/browse/JDK-8211029) (Sep 2018), yet the only place where it triggers introduced by [JDK-8259609](https://bugs.openjdk.org/browse/JDK-8259609) (Oct 2021). It seems other places were fixed meanwhile. >> >> I believe the fixed place is just a simple leftover. Right, @rwestrel? >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] The build matrix of: >> - GCC 10 >> - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} >> - {server} >> - {release, fastdebug} > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' into JDK-8294467-warning-sequence-point > - Fix Thank you all! ------------- PR: https://git.openjdk.org/jdk/pull/10454 From shade at openjdk.org Wed Oct 19 18:55:29 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Oct 2022 18:55:29 GMT Subject: Integrated: 8294467: Fix sequence-point warnings in Hotspot In-Reply-To: References: Message-ID: On Tue, 27 Sep 2022 16:34:17 GMT, Aleksey Shipilev wrote: > There seem to be the only place in Hotspot where this warning fires, yet the warning is disabled wholesale for Hotspot. This is not good. > > I can trace the addition of sequence-point exclusion to [JDK-8211029](https://bugs.openjdk.org/browse/JDK-8211029) (Sep 2018), yet the only place where it triggers introduced by [JDK-8259609](https://bugs.openjdk.org/browse/JDK-8259609) (Oct 2021). It seems other places were fixed meanwhile. > > I believe the fixed place is just a simple leftover. Right, @rwestrel? > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] The build matrix of: > - GCC 10 > - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} > - {server} > - {release, fastdebug} This pull request has now been integrated. Changeset: 388a56e4 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/388a56e4c4278f2a3da31946b15a45f3aee25e58 Stats: 2 lines in 2 files changed: 0 ins; 1 del; 1 mod 8294467: Fix sequence-point warnings in Hotspot Reviewed-by: dholmes, thartmann, roland ------------- PR: https://git.openjdk.org/jdk/pull/10454 From dlong at openjdk.org Wed Oct 19 18:59:04 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Oct 2022 18:59:04 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 07:36:01 GMT, Dean Long wrote: >> The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. >> Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. >> The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > allow test to run with release builds Thanks Christian and Vladimir. ------------- PR: https://git.openjdk.org/jdk/pull/10749 From kvn at openjdk.org Wed Oct 19 19:14:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Oct 2022 19:14:08 GMT Subject: RFR: 8255746: Make PrintCompilation available on a per method level [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 15:38:24 GMT, Joshua Cao wrote: >> Otherwise, looks good! > >> Otherwise, looks good! > > Thanks for the review. I've added all your suggestions. @caojoshua you may need to issue integration command again. ------------- PR: https://git.openjdk.org/jdk/pull/10668 From eastigeevich at openjdk.org Wed Oct 19 19:41:03 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 19 Oct 2022 19:41:03 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v5] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 09:42:54 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup and rename src/hotspot/share/code/compressedStream.cpp line 182: > 180: return; > 181: } > 182: int bit0 = 0x80; // first byte upper bit is set to indicate a value is not zero This is untraditional and confusing. In a byte we are preparing it is bit 7. The same comment is about `bit1`. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From duke at openjdk.org Wed Oct 19 19:43:34 2022 From: duke at openjdk.org (Joshua Cao) Date: Wed, 19 Oct 2022 19:43:34 GMT Subject: Integrated: 8255746: Make PrintCompilation available on a per method level In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 01:31:45 GMT, Joshua Cao wrote: > Example: > > > [~/jdk/jdk]$ build/linux-x86_64-server-fastdebug/jdk/bin/java -XX:CompileCommand=PrintCompilation,java.lang.String::length Hello > CompileCommand: PrintCompilation java/lang/String.length bool PrintCompilation = true > 223 12 3 java.lang.String::length (11 bytes) > 405 307 4 java.lang.String::length (11 bytes) > hello world > > > Running `java -XX:+PrintCompilation` still prints every method. This change also moves the declaration of `elapsedTimer`, but it should have insignificant impact on actual elapsed time. > > --- > > Additionally, I make a change to `test/lib-test/jdk/test/whitebox/vm_flags/BooleanTest.java` so that it does not depend on PrintCompilation. The test was failing because it updates global `PrintCompilation` during the middle of the run, but this does not change the value of `PrintCompilationOption` for individual CompileTask directives. > > I modified the test so that it is similar to other [WhiteBox vm_flag test](https://github.com/openjdk/jdk/tree/master/test/lib-test/jdk/test/whitebox/vm_flags). It still tests `VmFlagTest.WHITE_BOX::get/setBooleanVMFlag`, without having to depend on the behavior on the specific flag. This pull request has now been integrated. Changeset: f872467d Author: Joshua Cao Committer: Xin Liu URL: https://git.openjdk.org/jdk/commit/f872467d69a6d8442f8004609ce819641cab568b Stats: 224 lines in 8 files changed: 172 ins; 43 del; 9 mod 8255746: Make PrintCompilation available on a per method level Reviewed-by: chagedorn, kvn, xliu ------------- PR: https://git.openjdk.org/jdk/pull/10668 From dlong at openjdk.org Wed Oct 19 19:46:06 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Oct 2022 19:46:06 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 07:36:01 GMT, Dean Long wrote: >> The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. >> Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. >> The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > allow test to run with release builds @theRealAph, do you agree with this fix? ------------- PR: https://git.openjdk.org/jdk/pull/10749 From eastigeevich at openjdk.org Wed Oct 19 20:36:07 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 19 Oct 2022 20:36:07 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v5] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 09:42:54 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup and rename src/hotspot/share/code/compressedStream.cpp line 190: > 188: write_byte_impl(bit1 | (next & 0x7f)); > 189: next >>= 7; > 190: } Could you please document the encoding/decoding schema in comments to the class? Could you also please include examples? Why is it chosen to split 32 bits into 6, 7, 7, 7 and 5? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From xxinliu at amazon.com Thu Oct 20 00:27:36 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Wed, 19 Oct 2022 17:27:36 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <1da07de9-90d2-d4ad-188e-d7d976009f52@oracle.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> <0bc75ee6-641f-1145-8fde-6d11e2ec887e@amazon.com> <1da07de9-90d2-d4ad-188e-d7d976009f52@oracle.com> Message-ID: <4768851c-2f3b-69be-ce28-070dae4792c7@amazon.com> Hi, I would like to update on this. I manage to get PEA work in Vladimir Ivanov's testcase. I put the testcase, assembly and graphs here[1]. Even though it is a quite simple case, I think it demonstrates that the RFC is practical in C2. I proposed 3 major differences from Graal. 1. The algorithm runs in parser instead of optimizer. 2. Prefer clone-and-eliminate strategy rather than virtualize-and-materialize. 3. Refrain from scalar replacement on-the-fly. The test excises them all. I pasted 3 graphs here[2]. When we materialize an object, we just clone it with the right JVMState. It shows that C2 IterEA can automatically picks up the obsolete object and get rid of it, as we expected. It turns out cloning an object isn't as complex as I thought. I mainly spent time on adjusting JVMState for the cloned AllocateNode. Not only to call sync_jvm(), I also need to 1) kill dead locals 2) clean stack and even avoid reexecution that bci. JVMState* jvms = parser->sync_jvms(); SafePointNode* map = jvms->map(); parser->kill_dead_locals(); parser->clean_stack(jvms->sp()); jvms->set_should_reexecute(false); Clearly, the algorithm hasn't completed yet. I am still working on MergeProcessor, general classes fields and loop construct. I haven't figured out how to test PEA in a reliable way. It is not easy for IR framework to capture node movement. If we measure allocation rate, it will be subject to CPU capability and also the sampling rate. I came up with an idea so-called 'Epsilon-Test'. We create a JVM with EpsilonGC and a fixed Java heap. Because EpsilonGC never replenish the java heap, we can count how many iterations a test can run before OOME. The less allocation made in a method, the more iterations HotSpot can execute the method. This isn't perfect either. I found that hotspot can't guarantee to execute the final-block in this case[3]. So far, I just measure execution time instead. Appreciate your feedbacks or you spot any redflag. [1] https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79 [2] https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79?permalink_comment_id=4341838#gistcomment-4341838 [3]?https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79#file-example1-java-L43 thanks, --lx On 10/12/22 11:17 AM, Vladimir Kozlov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 10/12/22 7:58 AM, Liu, Xin wrote: >> hi, Vladimir, >>> You should show that your implementation can rematirealize an object >> at any escape site. >> >> My understanding is I suppose to 'materialize' an object at any escape site. > > Words ;^) > > Yes, I mistyped and misspelled. > > Vladimir K > >> >> 'rematerialize' refers to 'create an scalar-replaced object on heap' in >> deoptimization. It's for interpreter as if the object was created in the >> first place. It doesn't apply to an escaped object because it's marked >> 'GlobalEscaped' in C2 EA. >> >> >> Okay. I will try this idea! >> >> thanks, >> --lx >> >> >> >> >> On 10/11/22 3:12 PM, Vladimir Kozlov wrote: >>> Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore. -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From xgong at openjdk.org Thu Oct 20 02:19:01 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 20 Oct 2022 02:19:01 GMT Subject: RFR: 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 16:37:13 GMT, Quan Anh Mai wrote: >> This patch removes the incorrectly used `evmovdqub` and uses `evmovdqul` because the former requires AVX512BW, which is unavailable on KNL settings. We have `C2_MacroAssembler::load_vector` already so we can use it here instead. >> >> Thanks a lot. > > Thanks very much for your reviews. Thanks for fixing the issue so quickly @merykitty ! ------------- PR: https://git.openjdk.org/jdk/pull/10764 From bulasevich at openjdk.org Thu Oct 20 07:16:50 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 20 Oct 2022 07:16:50 GMT Subject: RFR: 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly Message-ID: This is a fix for an apparent code bug: inline int CodeSection::alignment(int section) { if (section == CodeBuffer::SECT_CONSTS) { return (int) sizeof(jdouble); } if (section == CodeBuffer::SECT_INSTS) { return (int) CodeEntryAlignment; } if (CodeBuffer::SECT_STUBS) { <--- here must be (section == CodeBuffer::SECT_STUBS) condition! // CodeBuffer installer expects sections to be HeapWordSize aligned return HeapWordSize; } ShouldNotReachHere(); return 0; } Also, the section size initializer code is moved to initialize_misc() to fix the code path that works with uninitialized data. ------------- Commit messages: - 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly Changes: https://git.openjdk.org/jdk/pull/10699/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10699&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294460 Stats: 8 lines in 1 file changed: 4 ins; 3 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10699.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10699/head:pull/10699 PR: https://git.openjdk.org/jdk/pull/10699 From shade at openjdk.org Thu Oct 20 07:47:48 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 07:47:48 GMT Subject: RFR: 8295709: Linux AArch64 builds broken after JDK-8294438 Message-ID: Currently fails with: ``` * For target hotspot_variant-server_libjvm_objs_assembler_aarch64.o: ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/10781/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10781&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295709 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10781.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10781/head:pull/10781 PR: https://git.openjdk.org/jdk/pull/10781 From aph at openjdk.org Thu Oct 20 07:55:54 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 20 Oct 2022 07:55:54 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 07:36:01 GMT, Dean Long wrote: >> The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. >> Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. >> The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > allow test to run with release builds Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10749 From dholmes at openjdk.org Thu Oct 20 07:56:19 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 20 Oct 2022 07:56:19 GMT Subject: RFR: 8295709: Linux AArch64 builds broken after JDK-8294438 In-Reply-To: References: Message-ID: <7HIneDiZD1d9Vd2LP7EOPWf-gf_UnEL7CmupOJzX52g=.e67af897-41ae-473d-9bf5-df2b8822eda3@github.com> On Thu, 20 Oct 2022 07:40:21 GMT, Aleksey Shipilev wrote: > This seems to only manifest with GCC 11+: > > ``` > * For target hotspot_variant-server_libjvm_objs_assembler_aarch64.o: Looks good and trivial. Thanks for the quick fix. ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.org/jdk/pull/10781 From aph at openjdk.org Thu Oct 20 07:56:19 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 20 Oct 2022 07:56:19 GMT Subject: RFR: 8295709: Linux AArch64 builds broken after JDK-8294438 In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 07:40:21 GMT, Aleksey Shipilev wrote: > This seems to only manifest with GCC 11+: > > ``` > * For target hotspot_variant-server_libjvm_objs_assembler_aarch64.o: Yes, trivial. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/10781 From shade at openjdk.org Thu Oct 20 07:59:31 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 07:59:31 GMT Subject: RFR: 8295709: Linux AArch64 builds broken after JDK-8294438 In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 07:40:21 GMT, Aleksey Shipilev wrote: > This seems to only manifest with GCC 11+: > > ``` > * For target hotspot_variant-server_libjvm_objs_assembler_aarch64.o: Thanks and sorry for breakage. I am looking if there are more breakages on higher GCC versions for other arches too... ------------- PR: https://git.openjdk.org/jdk/pull/10781 From shade at openjdk.org Thu Oct 20 08:01:13 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 08:01:13 GMT Subject: Integrated: 8295709: Linux AArch64 builds broken after JDK-8294438 In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 07:40:21 GMT, Aleksey Shipilev wrote: > This seems to only manifest with GCC 11+: > > ``` > * For target hotspot_variant-server_libjvm_objs_assembler_aarch64.o: This pull request has now been integrated. Changeset: 4f994c03 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/4f994c037023603bb1d1d94ad97aeb01ac604ebd Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8295709: Linux AArch64 builds broken after JDK-8294438 Reviewed-by: dholmes, aph ------------- PR: https://git.openjdk.org/jdk/pull/10781 From bulasevich at openjdk.org Thu Oct 20 12:04:32 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 20 Oct 2022 12:04:32 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: minor renaming. adding encoding examples table ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/a461a10e..f365d780 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=04-05 Stats: 22 lines in 1 file changed: 17 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Thu Oct 20 12:04:36 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 20 Oct 2022 12:04:36 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v5] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 19:38:45 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup and rename > > src/hotspot/share/code/compressedStream.cpp line 182: > >> 180: return; >> 181: } >> 182: int bit0 = 0x80; // first byte upper bit is set to indicate a value is not zero > > This is untraditional and confusing. In a byte we are preparing it is bit 7. The same comment is about `bit1`. Ok > src/hotspot/share/code/compressedStream.cpp line 190: > >> 188: write_byte_impl(bit1 | (next & 0x7f)); >> 189: next >>= 7; >> 190: } > > Could you please document the encoding/decoding schema in comments to the class? Could you also please include examples? > Why is it chosen to split 32 bits into 6, 7, 7, 7 and 5? The first bit is occupied by a flag holding a non-zero state, additionally each byte has a last byte flag, leaving only 6/7/7/7/7 bits for the payload. I have added an examples table: // integer value encoded as a sequence of 1 to 5 bytes // - the most frequent case (0 < x < 64) is encoded in one byte // - the payload of the first byte is 6 bits, the payload of the following bytes is 7 bits // - the most significant bit in the first byte is occupied by a zero flag // - each byte has a bit indicating whether it is the last byte in the sequence // // value | byte0 | byte1 | byte2 | byte3 | byte4 // -----------+----------+----------+----------+----------+---------- // 0 | 0 | | | | // 1 | 10000001 | | | | // 2 | 10000010 | | | | // 63 | 10111111 | | | | // 64 | 11000000 | 00000001 | | | // 65 | 11000001 | 00000001 | | | // 8191 | 11111111 | 01111111 | | | // 8192 | 11000000 | 10000000 | 00000001 | | // 8193 | 11000001 | 10000000 | 00000001 | | // 1048575 | 11111111 | 11111111 | 01111111 | | // 1048576 | 11000000 | 10000000 | 10000000 | 00000001 | // 0xFFFFFFFF | 11111111 | 11111111 | 11111111 | 11111111 | 00011111 // ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bkilambi at openjdk.org Thu Oct 20 14:39:58 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 20 Oct 2022 14:39:58 GMT Subject: RFR: 8295276: AArch64: Add backend support for half float conversion intrinsics Message-ID: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> This patch adds aarch64 backend support for library intrinsics that implement conversions between half-precision and single-precision floats. Ran the following benchmarks to assess the performance with this patch - org.openjdk.bench.java.math.Fp16ConversionBenchmark.floatToFloat16 org.openjdk.bench.java.math.Fp16ConversionBenchmark.float16ToFloat The performance (ops/ms) gain with the patch on an ARM NEON machine is shown below - Benchmark Gain Fp16ConversionBenchmark.float16ToFloat 3.42 Fp16ConversionBenchmark.floatToFloat16 5.85 ------------- Commit messages: - 8295276: AArch64: Add backend support for half float conversion intrinsics Changes: https://git.openjdk.org/jdk/pull/10796/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10796&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295276 Stats: 658 lines in 4 files changed: 34 ins; 0 del; 624 mod Patch: https://git.openjdk.org/jdk/pull/10796.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10796/head:pull/10796 PR: https://git.openjdk.org/jdk/pull/10796 From phh at openjdk.org Thu Oct 20 17:38:54 2022 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 20 Oct 2022 17:38:54 GMT Subject: RFR: 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly In-Reply-To: References: Message-ID: <1ujR6IRVwKnF2qOUkvLZ03e4ck799QOzIIr0ornlJII=.067bbc63-1173-49d3-87ae-0491459fb4aa@github.com> On Thu, 13 Oct 2022 13:50:55 GMT, Boris Ulasevich wrote: > This is a fix for an apparent code bug: > > > inline int CodeSection::alignment(int section) { > if (section == CodeBuffer::SECT_CONSTS) { > return (int) sizeof(jdouble); > } > if (section == CodeBuffer::SECT_INSTS) { > return (int) CodeEntryAlignment; > } > if (CodeBuffer::SECT_STUBS) { <--- here must be (section == CodeBuffer::SECT_STUBS) condition! > // CodeBuffer installer expects sections to be HeapWordSize aligned > return HeapWordSize; > } > ShouldNotReachHere(); > return 0; > } > > > Also, the section size initializer code is moved to initialize_misc() to fix the code path that works with uninitialized data. Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/10699 From kvn at openjdk.org Thu Oct 20 19:14:46 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Oct 2022 19:14:46 GMT Subject: RFR: 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 13:50:55 GMT, Boris Ulasevich wrote: > This is a fix for an apparent code bug: > > > inline int CodeSection::alignment(int section) { > if (section == CodeBuffer::SECT_CONSTS) { > return (int) sizeof(jdouble); > } > if (section == CodeBuffer::SECT_INSTS) { > return (int) CodeEntryAlignment; > } > if (CodeBuffer::SECT_STUBS) { <--- here must be (section == CodeBuffer::SECT_STUBS) condition! > // CodeBuffer installer expects sections to be HeapWordSize aligned > return HeapWordSize; > } > ShouldNotReachHere(); > return 0; > } > > > Also, the section size initializer code is moved to initialize_misc() to fix the code path that works with uninitialized data. Good. I will test it. ------------- PR: https://git.openjdk.org/jdk/pull/10699 From rrich at openjdk.org Thu Oct 20 19:36:33 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 20 Oct 2022 19:36:33 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode Message-ID: With `StressReflectiveCode` C2 has inexact type information which can prevent ea based optimizations (see `ConnectionGraph::add_call_node()`) This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations of allocations nor deoptimization of corresponding frames upon debugger access. Tested on the standard platforms with fastdebug and release builds. make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" ------------- Commit messages: - Adapt test to -XX:+StressReflectiveCode Changes: https://git.openjdk.org/jdk/pull/10769/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10769&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295413 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10769.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10769/head:pull/10769 PR: https://git.openjdk.org/jdk/pull/10769 From rrich at openjdk.org Thu Oct 20 19:36:34 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 20 Oct 2022 19:36:34 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 15:43:17 GMT, Richard Reingruber wrote: > With `StressReflectiveCode` C2 has inexact type information which can prevent ea > based optimizations (see `ConnectionGraph::add_call_node()`) > > This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag > `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations > of allocations nor deoptimization of corresponding frames upon debugger access. > > Tested on the standard platforms with fastdebug and release builds. > > > make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" The tests haven't finished in 12h. Not sure if they ever will... Local testing (see above) was good though. ------------- PR: https://git.openjdk.org/jdk/pull/10769 From lmesnik at openjdk.org Thu Oct 20 19:45:52 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Thu, 20 Oct 2022 19:45:52 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode In-Reply-To: References: Message-ID: <3qvJzMyc66vLsJ9O8ulbcxq-MEtCS0JDtpbsz5hUUkk=.397a5bf6-45c7-4742-8287-ff801673aaf9@github.com> On Wed, 19 Oct 2022 15:43:17 GMT, Richard Reingruber wrote: > With `StressReflectiveCode` C2 has inexact type information which can prevent ea > based optimizations (see `ConnectionGraph::add_call_node()`) > > This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag > `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations > of allocations nor deoptimization of corresponding frames upon debugger access. > > Tested on the standard platforms with fastdebug and release builds. > > > make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" Marked as reviewed by lmesnik (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10769 From kvn at openjdk.org Thu Oct 20 19:54:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Oct 2022 19:54:52 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 15:43:17 GMT, Richard Reingruber wrote: > With `StressReflectiveCode` C2 has inexact type information which can prevent ea > based optimizations (see `ConnectionGraph::add_call_node()`) > > This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag > `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations > of allocations nor deoptimization of corresponding frames upon debugger access. > > Tested on the standard platforms with fastdebug and release builds. > > > make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" Should we add check for `StressReflectiveCode` to all `shouldSkip()` methods too? Similar to `DeoptimizeObjectsALot` flag. ------------- PR: https://git.openjdk.org/jdk/pull/10769 From rrich at openjdk.org Thu Oct 20 20:06:53 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 20 Oct 2022 20:06:53 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode In-Reply-To: References: Message-ID: <7q1G-wQFSnuFcRTY2xWikM3LQkj9kMJb5wAzrmBU6-c=.c6f5811b-9aad-4d86-8342-fc8aac7a5419@github.com> On Thu, 20 Oct 2022 19:52:39 GMT, Vladimir Kozlov wrote: > Should we add check for `StressReflectiveCode` to all `shouldSkip()` methods too? Similar to `DeoptimizeObjectsALot` flag. I'd be ok to skip all testcases (by checking for `StressReflectiveCode` in the base classes' `shouldSkip()`). The tests' purpose is to check if ea based optimizations are reverted appropriately. Doesn't help much to run them with `StressReflectiveCode` except for random fuzzing. Would you like me to skip all test cases if running with `StressReflectiveCode`? ------------- PR: https://git.openjdk.org/jdk/pull/10769 From kvn at openjdk.org Thu Oct 20 20:13:47 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Oct 2022 20:13:47 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode In-Reply-To: <7q1G-wQFSnuFcRTY2xWikM3LQkj9kMJb5wAzrmBU6-c=.c6f5811b-9aad-4d86-8342-fc8aac7a5419@github.com> References: <7q1G-wQFSnuFcRTY2xWikM3LQkj9kMJb5wAzrmBU6-c=.c6f5811b-9aad-4d86-8342-fc8aac7a5419@github.com> Message-ID: On Thu, 20 Oct 2022 20:04:23 GMT, Richard Reingruber wrote: > > Should we add check for `StressReflectiveCode` to all `shouldSkip()` methods too? Similar to `DeoptimizeObjectsALot` flag. > > I'd be ok to skip all testcases (by checking for `StressReflectiveCode` in the base classes' `shouldSkip()`). The tests' purpose is to check if ea based optimizations are reverted appropriately. Doesn't help much to run them with `StressReflectiveCode` except for random fuzzing. > > Would you like me to skip all test cases if running with `StressReflectiveCode`? Yes. If this flag essentially disables EA I don't see a reason to run EA related tests. ------------- PR: https://git.openjdk.org/jdk/pull/10769 From rrich at openjdk.org Thu Oct 20 20:55:12 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 20 Oct 2022 20:55:12 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode [v2] In-Reply-To: References: Message-ID: > With `StressReflectiveCode` C2 has inexact type information which can prevent ea > based optimizations (see `ConnectionGraph::add_call_node()`) > > This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag > `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations > of allocations nor deoptimization of corresponding frames upon debugger access. > > Tested on the standard platforms with fastdebug and release builds. > > > make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Skip all test cases if StressReflectiveCode is enabled ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10769/files - new: https://git.openjdk.org/jdk/pull/10769/files/9071e5b4..c8c346f8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10769&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10769&range=00-01 Stats: 16 lines in 1 file changed: 15 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10769.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10769/head:pull/10769 PR: https://git.openjdk.org/jdk/pull/10769 From dlong at openjdk.org Thu Oct 20 21:24:46 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 20 Oct 2022 21:24:46 GMT Subject: RFR: 8295414: [Aarch64] C2: assert(false) failed: bad AD file [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 07:36:01 GMT, Dean Long wrote: >> The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. >> Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. >> The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > allow test to run with release builds Thanks Andrew. ------------- PR: https://git.openjdk.org/jdk/pull/10749 From dlong at openjdk.org Thu Oct 20 21:28:50 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 20 Oct 2022 21:28:50 GMT Subject: Integrated: 8295414: [Aarch64] C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 18:52:43 GMT, Dean Long wrote: > The "bad AD file" error is because PartialSubtypeCheck produces an iRegP_R5 result, which cannot be matched as an input where iRegP is expected. > Update the test to reproduce the crash and updated iRegP rule to match iRegP_R5. > The fact that this went so long without being noticed makes me wonder how much test coverage PartialSubtypeCheck has received on aarch64. This pull request has now been integrated. Changeset: d3eba859 Author: Dean Long URL: https://git.openjdk.org/jdk/commit/d3eba859f9c87465a8f1c0dfd6dd5aef368d5853 Stats: 7 lines in 2 files changed: 6 ins; 0 del; 1 mod 8295414: [Aarch64] C2: assert(false) failed: bad AD file Reviewed-by: kvn, chagedorn, aph ------------- PR: https://git.openjdk.org/jdk/pull/10749 From kvn at openjdk.org Thu Oct 20 22:06:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Oct 2022 22:06:52 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode [v2] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 20:55:12 GMT, Richard Reingruber wrote: >> With `StressReflectiveCode` C2 has inexact type information which can prevent ea >> based optimizations (see `ConnectionGraph::add_call_node()`) >> >> This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag >> `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations >> of allocations nor deoptimization of corresponding frames upon debugger access. >> >> Tested on the standard platforms with fastdebug and release builds. >> >> >> make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Skip all test cases if StressReflectiveCode is enabled Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10769 From vladimir.x.ivanov at oracle.com Fri Oct 21 00:26:56 2022 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 20 Oct 2022 17:26:56 -0700 Subject: [EXTERNAL][EXTERNAL]RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <4768851c-2f3b-69be-ce28-070dae4792c7@amazon.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> <0bc75ee6-641f-1145-8fde-6d11e2ec887e@amazon.com> <1da07de9-90d2-d4ad-188e-d7d976009f52@oracle.com> <4768851c-2f3b-69be-ce28-070dae4792c7@amazon.com> Message-ID: <4127d57d-ca6f-0cda-13a8-efbdd2ef0501@oracle.com> Hi, > I would like to update on this. I manage to get PEA work in Vladimir > Ivanov's testcase. I put the testcase, assembly and graphs here[1]. > > Even though it is a quite simple case, I think it demonstrates that the > RFC is practical in C2. I proposed 3 major differences from Graal. Nice! Also, a very similar (but a much more popular case) should be escape sites in catch blocks (as reported by JDK-8267532 [1]). > 1. The algorithm runs in parser instead of optimizer. > 2. Prefer clone-and-eliminate strategy rather than > virtualize-and-materialize. > 3. Refrain from scalar replacement on-the-fly. I don't understand how you plan to implement it solely during parsing. You could do some bookkeeping during parsing and capture JVM state, but I don't see how to do EA that early. Also, please, elaborate on #3. It's not clear to me what do you mean there. > The test excises them all. I pasted 3 graphs here[2]. When we > materialize an object, we just clone it with the right JVMState. It > shows that C2 IterEA can automatically picks up the obsolete object and > get rid of it, as we expected. > > It turns out cloning an object isn't as complex as I thought. I mainly > spent time on adjusting JVMState for the cloned AllocateNode. Not only > to call sync_jvm(), I also need to 1) kill dead locals 2) clean stack > and even avoid reexecution that bci. > > JVMState* jvms = parser->sync_jvms(); > SafePointNode* map = jvms->map(); > parser->kill_dead_locals(); > parser->clean_stack(jvms->sp()); > jvms->set_should_reexecute(false); > > Clearly, the algorithm hasn't completed yet. I am still working on > MergeProcessor, general classes fields and loop construct. There was a previous discussion on PEA for C2 back in 2021 [2] [3]. One interesting observation related to your current experiments was: "4. Escape sites separate the graph into 2 parts: before and after the instance escapes. In order to preserve identity invariants (and avoid identity paradoxes), PEA can't just put an allocation at every escape site. It should respect the order of escape events and ensure that the very same object is observed when multiple escape events happen. Dynamic invariant can be formulated as: there should never be more than 1 allocation at runtime per 1 eliminated allocation. Considering non-escaping operations can force materialization on their own, it poses additional constraints." So, when you clone an allocation, you should ensure that only a single instance can be observed. And safepoints can be escape points as well (rematerialization in case of deoptimization event). > I haven't figured out how to test PEA in a reliable way. It is not easy > for IR framework to capture node movement. If we measure allocation > rate, it will be subject to CPU capability and also the sampling rate. I > came up with an idea so-called 'Epsilon-Test'. We create a JVM with > EpsilonGC and a fixed Java heap. Because EpsilonGC never replenish the > java heap, we can count how many iterations a test can run before OOME. > The less allocation made in a method, the more iterations HotSpot can > execute the method. This isn't perfect either. I found that hotspot > can't guarantee to execute the final-block in this case[3]. So far, I > just measure execution time instead. It sounds more like a job for benchmarks, but focused on measuring allocation rate (per iteration). ("-prof gc" mode in JMH terms.) Personally, I very much liked the IR framework-based approach Cesar used in the unit test for allocation merges [4]. Do you see any problems with that? Best regards, Vladimir Ivanov [1] https://bugs.openjdk.org/browse/JDK-8267532 [2] https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047486.html [3] https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047536.html [4] https://github.com/openjdk/jdk/pull/9073 > > Appreciate your feedbacks or you spot any redflag. > > [1] https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79 > > [2] > https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79?permalink_comment_id=4341838#gistcomment-4341838 > > [3]?https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79#file-example1-java-L43 > > thanks, > --lx > > > > > On 10/12/22 11:17 AM, Vladimir Kozlov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> On 10/12/22 7:58 AM, Liu, Xin wrote: >>> hi, Vladimir, >>>> You should show that your implementation can rematirealize an object >>> at any escape site. >>> >>> My understanding is I suppose to 'materialize' an object at any escape site. >> >> Words ;^) >> >> Yes, I mistyped and misspelled. >> >> Vladimir K >> >>> >>> 'rematerialize' refers to 'create an scalar-replaced object on heap' in >>> deoptimization. It's for interpreter as if the object was created in the >>> first place. It doesn't apply to an escaped object because it's marked >>> 'GlobalEscaped' in C2 EA. >>> >>> >>> Okay. I will try this idea! >>> >>> thanks, >>> --lx >>> >>> >>> >>> >>> On 10/11/22 3:12 PM, Vladimir Kozlov wrote: >>>> Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore. From kvn at openjdk.org Fri Oct 21 02:04:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Oct 2022 02:04:54 GMT Subject: RFR: 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 13:50:55 GMT, Boris Ulasevich wrote: > This is a fix for an apparent code bug: > > > inline int CodeSection::alignment(int section) { > if (section == CodeBuffer::SECT_CONSTS) { > return (int) sizeof(jdouble); > } > if (section == CodeBuffer::SECT_INSTS) { > return (int) CodeEntryAlignment; > } > if (CodeBuffer::SECT_STUBS) { <--- here must be (section == CodeBuffer::SECT_STUBS) condition! > // CodeBuffer installer expects sections to be HeapWordSize aligned > return HeapWordSize; > } > ShouldNotReachHere(); > return 0; > } > > > Also, the section size initializer code is moved to initialize_misc() to fix the code path that works with uninitialized data. My testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10699 From jiefu at openjdk.org Fri Oct 21 04:02:47 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 21 Oct 2022 04:02:47 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 Message-ID: Hi all, Many vector api tests fail on x86_32 after JDK-8293409 due to computing incorrect results. The reason is that `generate_iota_indices` was updated only for x86_64 in JDK-8293409. So let's fix it for x86_32. Testing: - vector api tests on x86_32, all passed Thanks. Best regards, Jie ------------- Commit messages: - 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 Changes: https://git.openjdk.org/jdk/pull/10807/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10807&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295762 Stats: 91 lines in 1 file changed: 91 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10807.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10807/head:pull/10807 PR: https://git.openjdk.org/jdk/pull/10807 From xgong at openjdk.org Fri Oct 21 04:10:48 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 21 Oct 2022 04:10:48 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 03:55:50 GMT, Jie Fu wrote: > Hi all, > > Many vector api tests fail on x86_32 after JDK-8293409 due to computing incorrect results. > The reason is that `generate_iota_indices` was updated only for x86_64 in JDK-8293409. > So let's fix it for x86_32. > > Testing: > - vector api tests on x86_32, all passed > > Thanks. > Best regards, > Jie Thanks for fixing it @DamonFool ! The change looks good to me! ------------- PR: https://git.openjdk.org/jdk/pull/10807 From xgong at openjdk.org Fri Oct 21 04:18:53 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 21 Oct 2022 04:18:53 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 03:55:50 GMT, Jie Fu wrote: > Hi all, > > Many vector api tests fail on x86_32 after JDK-8293409 due to computing incorrect results. > The reason is that `generate_iota_indices` was updated only for x86_64 in JDK-8293409. > So let's fix it for x86_32. > > Testing: > - vector api tests on x86_32, all passed > > Thanks. > Best regards, > Jie Marked as reviewed by xgong (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10807 From kvn at openjdk.org Fri Oct 21 04:34:14 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Oct 2022 04:34:14 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 03:55:50 GMT, Jie Fu wrote: > Hi all, > > Many vector api tests fail on x86_32 after JDK-8293409 due to computing incorrect results. > The reason is that `generate_iota_indices` was updated only for x86_64 in JDK-8293409. > So let's fix it for x86_32. > > Testing: > - vector api tests on x86_32, all passed > > Thanks. > Best regards, > Jie I can't comment on change. I assume it is copy from 64-bit code. But I am starting to concern about Vector API changes causing issues which were not caught during pre-integration testing. Unfortunately these tests run in [jdk_tier3](https://github.com/openjdk/jdk/blob/master/test/jdk/TEST.groups#L73) only. And as result are not part of GitHub Action testing. And in Oracle we don't test 32-bit. May I suggest in addition to currently run `tier1_part*` in GHA add `jdk_vector` to it. I looked on our internal testing times and all 3 `tier1_part*` and `jdk_vector` took about 5 min to run. ------------- PR: https://git.openjdk.org/jdk/pull/10807 From jiefu at openjdk.org Fri Oct 21 04:41:49 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 21 Oct 2022 04:41:49 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 04:08:46 GMT, Xiaohong Gong wrote: > Thanks for fixing it @DamonFool ! The change looks good to me! Thanks @XiaohongGong for your review. ------------- PR: https://git.openjdk.org/jdk/pull/10807 From jiefu at openjdk.org Fri Oct 21 04:46:50 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 21 Oct 2022 04:46:50 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 04:31:09 GMT, Vladimir Kozlov wrote: > I can't comment on change. I assume it is copy from 64-bit code. Yes, it was copied from 64-bit code. > But I am starting to concern about Vector API changes causing issues which were not caught during pre-integration testing. Unfortunately these tests run in [jdk_tier3](https://github.com/openjdk/jdk/blob/master/test/jdk/TEST.groups#L73) only. And as result are not part of GitHub Action testing. And in Oracle we don't test 32-bit. > > May I suggest in addition to currently run `tier1_part*` in GHA add `jdk_vector` to it. I looked on our internal testing times and all 3 `tier1_part*` and `jdk_vector` took about 5 min to run. Sounds good to me. How about adding the vector api tests in GHA in a separate PR? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10807 From kvn at openjdk.org Fri Oct 21 05:23:11 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Oct 2022 05:23:11 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 03:55:50 GMT, Jie Fu wrote: > Hi all, > > Many vector api tests fail on x86_32 after JDK-8293409 due to computing incorrect results. > The reason is that `generate_iota_indices` was updated only for x86_64 in JDK-8293409. > So let's fix it for x86_32. > > Testing: > - vector api tests on x86_32, all passed > > Thanks. > Best regards, > Jie Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10807 From kvn at openjdk.org Fri Oct 21 05:23:12 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Oct 2022 05:23:12 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 04:43:10 GMT, Jie Fu wrote: > How about adding the vector api tests in GHA in a separate PR? Yes, it is definitely separate changes. ------------- PR: https://git.openjdk.org/jdk/pull/10807 From sspitsyn at openjdk.org Fri Oct 21 06:36:50 2022 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Fri, 21 Oct 2022 06:36:50 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode [v2] In-Reply-To: References: Message-ID: <_i0klHTAL37AN_8B1oG1HX2D41PR4Josg_PMV9w-cQg=.a375e73a-5c71-4785-94b8-36bbfa405e62@github.com> On Thu, 20 Oct 2022 20:55:12 GMT, Richard Reingruber wrote: >> With `StressReflectiveCode` C2 has inexact type information which can prevent ea >> based optimizations (see `ConnectionGraph::add_call_node()`) >> >> This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag >> `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations >> of allocations nor deoptimization of corresponding frames upon debugger access. >> >> Tested on the standard platforms with fastdebug and release builds. >> >> >> make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Skip all test cases if StressReflectiveCode is enabled Looks good. Thanks, Serguei ------------- Marked as reviewed by sspitsyn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10769 From bulasevich at openjdk.org Fri Oct 21 08:54:47 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 21 Oct 2022 08:54:47 GMT Subject: RFR: 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 13:50:55 GMT, Boris Ulasevich wrote: > This is a fix for an apparent code bug: > > > inline int CodeSection::alignment(int section) { > if (section == CodeBuffer::SECT_CONSTS) { > return (int) sizeof(jdouble); > } > if (section == CodeBuffer::SECT_INSTS) { > return (int) CodeEntryAlignment; > } > if (CodeBuffer::SECT_STUBS) { <--- here must be (section == CodeBuffer::SECT_STUBS) condition! > // CodeBuffer installer expects sections to be HeapWordSize aligned > return HeapWordSize; > } > ShouldNotReachHere(); > return 0; > } > > > Also, the section size initializer code is moved to initialize_misc() to fix the code path that works with uninitialized data. thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10699 From bulasevich at openjdk.org Fri Oct 21 09:00:03 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 21 Oct 2022 09:00:03 GMT Subject: Integrated: 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 13:50:55 GMT, Boris Ulasevich wrote: > This is a fix for an apparent code bug: > > > inline int CodeSection::alignment(int section) { > if (section == CodeBuffer::SECT_CONSTS) { > return (int) sizeof(jdouble); > } > if (section == CodeBuffer::SECT_INSTS) { > return (int) CodeEntryAlignment; > } > if (CodeBuffer::SECT_STUBS) { <--- here must be (section == CodeBuffer::SECT_STUBS) condition! > // CodeBuffer installer expects sections to be HeapWordSize aligned > return HeapWordSize; > } > ShouldNotReachHere(); > return 0; > } > > > Also, the section size initializer code is moved to initialize_misc() to fix the code path that works with uninitialized data. This pull request has now been integrated. Changeset: 50647187 Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/50647187e8b0314ad67b0767f71c56fd50e8feaf Stats: 8 lines in 1 file changed: 4 ins; 3 del; 1 mod 8294460: CodeSection::alignment checks for CodeBuffer::SECT_STUBS incorrectly Reviewed-by: phh, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10699 From thartmann at openjdk.org Fri Oct 21 10:00:48 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 21 Oct 2022 10:00:48 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:28:26 GMT, vpaprotsk wrote: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s I executed some quick testing and this fails with: [2022-10-21T09:54:28,696Z] # A fatal error has been detected by the Java Runtime Environment: [2022-10-21T09:54:28,696Z] # [2022-10-21T09:54:28,696Z] # Internal Error (/opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S8380/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5903b026-cdbd-4aa4-8433-6a45fb7ee593/runs/f75b29aa-40ef-46a5-b323-3a80aaa9aa6b/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:5358), pid=2385300, tid=2385302 [2022-10-21T09:54:28,696Z] # Error: assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : vector_len == AVX_256bit ? VM_Version::supports_avx2() : vector_len == AVX_512bit ? VM_Version::supports_avx512bw() : 0) failed [2022-10-21T09:54:28,696Z] # [2022-10-21T09:54:28,696Z] # JRE version: (20.0) (fastdebug build ) [2022-10-21T09:54:28,696Z] # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-internal-2022-10-21-0733397.tobias.hartmann.jdk2, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) [2022-10-21T09:54:28,696Z] # Problematic frame: [2022-10-21T09:54:28,696Z] # V [libjvm.so+0x6e3bf0] Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190 ------------- PR: https://git.openjdk.org/jdk/pull/10582 From tholenstein at openjdk.org Fri Oct 21 10:49:21 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 21 Oct 2022 10:49:21 GMT Subject: RFR: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph Message-ID: IGV crashed when the user clicked on any edge in the graph because the `select` method of the `SelectProvider` in `LineWidget.java` was faulty. # Implementation - `ActionFactory.createSelectAction` was changed to `CustomSelectAction` since it also supports to invert the selection with `Ctrl/CMD` - The `select` method gets called when the user clicks on an edge. `LineWidget` represents a single connection going out of a node and connecting to one or more nodes (one-to-many). `LineWidget` has a list of `Connection`'s that each represent a single link between two nodes (one-to-one). For each `connection` we collect the from/to `Vertex` (superclass for a node) and put them into a set that we then use to select the nodes that are connected to the user-clicked `LineWidget`. ------------- Commit messages: - refactor related code - IGV: ClassCastException when clicking on an edge in the graph Changes: https://git.openjdk.org/jdk/pull/10760/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10760&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294565 Stats: 88 lines in 2 files changed: 18 ins; 15 del; 55 mod Patch: https://git.openjdk.org/jdk/pull/10760.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10760/head:pull/10760 PR: https://git.openjdk.org/jdk/pull/10760 From tholenstein at openjdk.org Fri Oct 21 12:55:16 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 21 Oct 2022 12:55:16 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action Message-ID: # Problem - The history of the Undo Action did not start not at zero when opening a new graph. - In the "Show sea of nodes" view you could press Undo once before doing any changes. - In the "Show control view graph" you could even go back many times when opening a new graph, ending up in wired stated of the graph. - Selecting a node should not be added to the history. You could select X different nodes in a row and then go back X times in history without that anything changed. # Overview UndoRedo Different `ChangedEvent` events are fired in `DiagramViewModel` that trigger the `DiagramScene` to update its state. After those updates `DiagramScene` calls `addUndo()` to save the new state of the `DiagramViewModel` as well as the scrolling position of the `DiagramScene` in a `DiagramUndoRedo` object. `undo()` / `redo()` can then be used to restore the saved state. One problem was that selecting nodes triggered a `ChangedEvent` event that caused a recording to the history but the selection itself was not part of the saved state. Another problem was that switching between different views (CFG view, sea-of-nodes view) as well as applying filters also triggered an `addUndo()` - The recording of filters and view switches is however not implemented. # Solution We now only record when the graph itself changes (not the view) - like when opening a new graph or creating a difference graph. And we also record changes in the selection of the visible nodes. We refactored the `ChangedEvent` events in `DiagramViewModel`: - `graphChangedEvent` now is fired when the graph changes and triggers an `addUndo()` - `hiddenNodesChangedEvent` is now fired when the selection of visible nodes changes and also triggers an `addUndo()` `DiagramUndoRedo` previously stored a deep copy of the `DiagramViewModel` for every datapoint in the history. Besides using a lot of memory, most of the stored objects were redundant or not used. Now, we introduced a new `ModelState` object that stores the id of the visible nodes in `Set hiddenNodes` as well as the opened graph. For the graph we only need `int firstPos` and `int secondPos` which indicate the position of the difference graph within the group. If we don't use a difference graph `firstPos` == `secondPos`. ------------- Commit messages: - JDK-8290010: IGV: Fix UndoRedo Action Changes: https://git.openjdk.org/jdk/pull/10813/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10813&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290010 Stats: 293 lines in 6 files changed: 41 ins; 172 del; 80 mod Patch: https://git.openjdk.org/jdk/pull/10813.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10813/head:pull/10813 PR: https://git.openjdk.org/jdk/pull/10813 From tholenstein at openjdk.org Fri Oct 21 15:07:16 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 21 Oct 2022 15:07:16 GMT Subject: RFR: JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge Message-ID: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> Outgoing edges in IGV are organized like trees. When hovering with the mouse over a segment of an edge, the edge is highlighted all the way up the the source node, as well as all the way down to the leave nodes. This works as expected. The nodes at the source and leaves of the highlighted segments are highlighted as well. There is a bug that instead of only highlighting the leave nodes in the subtree, IGV highlights all leave nodes: ![before](https://user-images.githubusercontent.com/71546117/197223077-962a5d97-c1c8-4720-9983-295e07468be9.png) # Solution The segments in the edge tree are `LineWidget` objects. Each `LineWidget` has a single `LineWidget` `predecessor` and list of `LineWidget` `successors`. When hovering over a line segment the `notifyStateChanged` function is called in the corresponding `LineWidget` : `predecessor` and `successors` `LineWidget` are recursively visited and highlighted here. The nodes (of super type `Vertex`) are new highlighted with a new `highlightVertices` function instead of using recursion. `highlightVertices(boolean enable)` uses the list of `connections` which already contains all the one-to-one connections between src/dest nodes that go through a single `LineWidget` segment. This gives as the `Vertex` nodes of the root as well as the leaves in the subtree of the hovered `LineWidget` segment. Now the highlighting of the leave nodes works as expected: ![ex1](https://user-images.githubusercontent.com/71546117/197225776-eb7cba50-6f6a-4f47-a91f-3b793021fdae.png) ![ex2](https://user-images.githubusercontent.com/71546117/197225789-d9510f36-d89a-44cd-ab69-1341edcafdfb.png) ------------- Commit messages: - JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge Changes: https://git.openjdk.org/jdk/pull/10815/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10815&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295461 Stats: 58 lines in 1 file changed: 25 ins; 21 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/10815.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10815/head:pull/10815 PR: https://git.openjdk.org/jdk/pull/10815 From shade at openjdk.org Fri Oct 21 15:33:35 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 21 Oct 2022 15:33:35 GMT Subject: RFR: 8295795: hsdis does not build with binutils 2.39+ Message-ID: Fails like this: $ sh ./configure --with-boot-jdk=jdk19u-ea --with-hsdis=binutils --with-binutils-src=binutils-2.39 $ make clean build-hsdis === Output from failing command(s) repeated here === * For target support_hsdis_hsdis-binutils.o: ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/10817/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10817&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295795 Stats: 7 lines in 1 file changed: 7 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10817.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10817/head:pull/10817 PR: https://git.openjdk.org/jdk/pull/10817 From shade at openjdk.org Fri Oct 21 15:36:19 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 21 Oct 2022 15:36:19 GMT Subject: RFR: 8295795: hsdis does not build with binutils 2.39+ In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 15:26:59 GMT, Aleksey Shipilev wrote: > Fails like this: > > > $ sh ./configure --with-boot-jdk=jdk19u-ea --with-hsdis=binutils --with-binutils-src=binutils-2.39 > $ make clean build-hsdis > > === Output from failing command(s) repeated here === > * For target support_hsdis_hsdis-binutils.o: Oh no, this does not work, hsdis SEGVs. Let me try and fix it. ------------- PR: https://git.openjdk.org/jdk/pull/10817 From smonteith at openjdk.org Fri Oct 21 15:52:47 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Fri, 21 Oct 2022 15:52:47 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand In-Reply-To: References: Message-ID: <1lypmyiTkFm3yUPgS0XCikXwB9-jSnKvEL8M6CF8leY=.66218777-48b8-4e8d-b9f6-2d60cee9602b@github.com> On Mon, 3 Oct 2022 14:00:51 GMT, Stuart Monteith wrote: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% Would it be possible for there to be a review of this please? ------------- PR: https://git.openjdk.org/jdk/pull/10537 From smonteith at openjdk.org Fri Oct 21 16:49:04 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Fri, 21 Oct 2022 16:49:04 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand In-Reply-To: <1lypmyiTkFm3yUPgS0XCikXwB9-jSnKvEL8M6CF8leY=.66218777-48b8-4e8d-b9f6-2d60cee9602b@github.com> References: <1lypmyiTkFm3yUPgS0XCikXwB9-jSnKvEL8M6CF8leY=.66218777-48b8-4e8d-b9f6-2d60cee9602b@github.com> Message-ID: On Fri, 21 Oct 2022 15:50:40 GMT, Stuart Monteith wrote: > Would it be possible for there to be a review of this please? Apologies, I've updated the description with the testcases that at relevant to these methods. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From duke at openjdk.org Fri Oct 21 18:09:51 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 21 Oct 2022 18:09:51 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 09:57:14 GMT, Tobias Hartmann wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > I executed some quick testing and this fails with: > > > [2022-10-21T09:54:28,696Z] # A fatal error has been detected by the Java Runtime Environment: > [2022-10-21T09:54:28,696Z] # > [2022-10-21T09:54:28,696Z] # Internal Error (/opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S8380/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5903b026-cdbd-4aa4-8433-6a45fb7ee593/runs/f75b29aa-40ef-46a5-b323-3a80aaa9aa6b/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:5358), pid=2385300, tid=2385302 > [2022-10-21T09:54:28,696Z] # Error: assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : vector_len == AVX_256bit ? VM_Version::supports_avx2() : vector_len == AVX_512bit ? VM_Version::supports_avx512bw() : 0) failed > [2022-10-21T09:54:28,696Z] # > [2022-10-21T09:54:28,696Z] # JRE version: (20.0) (fastdebug build ) > [2022-10-21T09:54:28,696Z] # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-internal-2022-10-21-0733397.tobias.hartmann.jdk2, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > [2022-10-21T09:54:28,696Z] # Problematic frame: > [2022-10-21T09:54:28,696Z] # V [libjvm.so+0x6e3bf0] Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190 Hi @TobiHartmann , thanks for looking. Could you share CPU Model and flags from `hs_err` please? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Fri Oct 21 18:23:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Oct 2022 18:23:08 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:28:26 GMT, vpaprotsk wrote: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Test: jdk/incubator/vector/VectorMaxConversionTests.java#id1 Flags: `-ea -esa -XX:UseAVX=3 -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting -XX:+UseZGC` CPU: Intel 8358 (all AVX512 features). I think the problem is this subtest runs with ` -XX:+UseKNLSetting`[VectorMaxConversionTests.java#L50](https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorMaxConversionTests.java#L50) which limits AVX512 features. Call stack: V [libjvm.so+0x6e3bf0] Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190 (assembler_x86.cpp:5358) V [libjvm.so+0x152a23b] MacroAssembler::poly1305_process_blocks_avx512(Register, Register, Register, Register, Register, Register, Register, Register)+0xc7b (macroAssembler_x86_poly.cpp:590) V [libjvm.so+0x152c23d] MacroAssembler::poly1305_process_blocks(Register, Register, Register, Register)+0x3ad (macroAssembler_x86_poly.cpp:849) V [libjvm.so+0x192dc00] StubGenerator::generate_poly1305_processBlocks()+0x170 (stubGenerator_x86_64.cpp:2069) V [libjvm.so+0x1936a89] StubGenerator::generate_initial()+0x419 (stubGenerator_x86_64.cpp:3798) V [libjvm.so+0x1937b78] StubGenerator_generate(CodeBuffer*, int)+0xf8 (stubGenerator_x86_64.hpp:526) V [libjvm.so+0x198e695] StubRoutines::initialize1() [clone .part.0]+0x155 (stubRoutines.cpp:229) V [libjvm.so+0xfc4342] init_globals()+0x32 (init.cpp:123) V [libjvm.so+0x1a7268f] Threads::create_vm(JavaVMInitArgs*, bool*)+0x37f ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 21 20:12:05 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 21 Oct 2022 20:12:05 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v2] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: Stash: fetch limbs directly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/7e070d9e..6a60c128 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=00-01 Stats: 61 lines in 3 files changed: 28 ins; 1 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 21 20:13:29 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 21 Oct 2022 20:13:29 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:28:26 GMT, vpaprotsk wrote: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s (Apologies, ignore the `Stash: fetch limbs directly` commit.. got git commit command mixed up.. will force-push a fix to the crash in a sec) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 21 20:20:58 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 21 Oct 2022 20:20:58 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v3] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s vpaprotsk has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: further restrict UsePolyIntrinsics with supports_avx512vlbw ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/6a60c128..f048f938 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=01-02 Stats: 62 lines in 4 files changed: 1 ins; 28 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 21 20:28:56 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 21 Oct 2022 20:28:56 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 18:20:10 GMT, Vladimir Kozlov wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Test: jdk/incubator/vector/VectorMaxConversionTests.java#id1 > Flags: `-ea -esa -XX:UseAVX=3 -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting -XX:+UseZGC` > CPU: Intel 8358 (all AVX512 features). > > I think the problem is this subtest runs with ` -XX:+UseKNLSetting`[VectorMaxConversionTests.java#L50](https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorMaxConversionTests.java#L50) which limits AVX512 features. > > Call stack: > > V [libjvm.so+0x6e3bf0] Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190 (assembler_x86.cpp:5358) > V [libjvm.so+0x152a23b] MacroAssembler::poly1305_process_blocks_avx512(Register, Register, Register, Register, Register, Register, Register, Register)+0xc7b (macroAssembler_x86_poly.cpp:590) > V [libjvm.so+0x152c23d] MacroAssembler::poly1305_process_blocks(Register, Register, Register, Register)+0x3ad (macroAssembler_x86_poly.cpp:849) > V [libjvm.so+0x192dc00] StubGenerator::generate_poly1305_processBlocks()+0x170 (stubGenerator_x86_64.cpp:2069) > V [libjvm.so+0x1936a89] StubGenerator::generate_initial()+0x419 (stubGenerator_x86_64.cpp:3798) > V [libjvm.so+0x1937b78] StubGenerator_generate(CodeBuffer*, int)+0xf8 (stubGenerator_x86_64.hpp:526) > V [libjvm.so+0x198e695] StubRoutines::initialize1() [clone .part.0]+0x155 (stubRoutines.cpp:229) > V [libjvm.so+0xfc4342] init_globals()+0x32 (init.cpp:123) > V [libjvm.so+0x1a7268f] Threads::create_vm(JavaVMInitArgs*, bool*)+0x37f Thanks @vnkozlov, was able to reproduce. @TobiHartmann, I added `supports_avx512vlbw` check to `UsePolyIntrinsics`. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From cslucas at openjdk.org Sat Oct 22 00:34:50 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Sat, 22 Oct 2022 00:34:50 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: <2oRk_zhIijFndoBhzxjsMFizcXEdjIF2Iryi5DSstCA=.1e9a06f9-1717-4bd3-918a-4fde78a50d02@github.com> On Thu, 6 Oct 2022 16:50:28 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix x86 tests. Hi Vladimir, first of all, thank you for reviewing the proposed patch. Sorry for the delay answering your questions, I'm right now on paternity leave from work and my time in front of a computer is being quite limited. > As of now, I don't fully grasp what's the purpose and motivation to introduce > ReducedAllocationMerge. I would be grateful for additional information about how > you ended up with the current design. (I went through the email thread, but it > didn't help me.) My first implementation to deal with the problem was just replacing Phi's merging object allocations by sets of Phi's merging field's loads from each different base. That works fine in some cases. However, one challenge that I faced in this approach, IIRC, was that the new Phi nodes (merging loads of fields) were being factored away (eliminated) as part of IGVN optimizations and consequently the graph ended up again with Phi's merging object allocations. I considered inserting Opaque nodes in some branches of the graph to prevent unadverted optimizations but at the end I found the approach below more "robust". The proposed idea of using a new type of node to represent the merges have these Pros, IMO. - By replacing Phi's merging allocations by a new type of node (i.e., RAM) the existing code (split_unique_types) will be able to assign `instance_id`s to the scalar replaceable inputs of the merging Phi. That means that those inputs will be scalar replaced using existing code in C2. Additional code will be necessary only to scalar replace the new type of node. - As a side effect of the point described above, we'll also be able use existing code in C2 to find last stored values to fields (i.e., find_inst_mem, value_from_mem, etc). - Existing optimizations will not interfere with the scalar replacement (as they did in the previous approach outlined) because they don't know how to handle the new type of node. In essence, the new type of node will be Opaque to existing optimizations. - We can safely, AFAIU, run igvn.optimize() after replacing the allocation merge Phis with the new type of node. - For last, an additional benefit of using a new type of node to store information, _instead_ of storing it in the ConnectionGraph, for instance, is that by using graph edges to capture the required "information" that the node need, the value is never in an "outdated" state. For instance, the current patch use input slots of the RAM node to store reference to required memory edges. Anytime a transformation happens in the graph and affect that memory edge the RAM node will be "automatically" updated using existing code in C2. If, instead, the memory edge was just stored as part of internal data of some class, then we'd need to handle those updates manually, AFAIU. Some arguably Cons of the current approach: - Seems complex at first glance. - New (non functional) node in the IR. The node is quite similar to a PhiNode in the sense that it's there just to represent a state (or some set of information). - The new node is a Macro node. Failure to remove it from the IR graph can cause compilation failure. > In particular, I still don't understand how it interacts with existing scalar > replacement logic when it comes to unique (per-allocation) memory slices. The RAM node itself doesn't interfere with alias indexes / memory slices creation at all. RAM nodes are created before "adjust_scalar_replaceable_state" is executed and because of that the allocations participating in merges _may_ be marked as ScalarReplaceable. Later, "split_unique_types" will run and be able to assign instance_id's to allocations participating in the merge because there is "virtually" no merge anymore at this point. During the Macro node elimination phase the allocation nodes will be visited and potentially scalar replaced _before_ any RAM node is visited. When an allocation node being scalar replaced is consumed by a RAM node some information are "registered" in the RAM node. Those information will later be used when the time come to scalar replace the RAM node itself. I hope I have answered your question. If not please let me know and I'll be happy to give more details. > How hard would it be to extend the test with cases which demonstrate existing > limitations? I'll try and create some test cases that trigger those limitations. > Also, I believe you face some ideal graph inconsistencies because you capture > information too early (before split_unique_types and following IGVN pass; and > previous allocation eliminations during eliminate_macro_nodes() may contribute > to that). Can you please elaborate on that? > Following up on my earlier question about interactions with > split_unique_types(), I'm worried that you remove corresponding LocalVars from > the ConnectionGraph and introduce with unique memory slices. I'd feel much more > confident in the correctness if you split slices for unions of interacting > allocations instead. Can you please explain a bit more about this idea? Are you proposing to split the phi into slices instead of removing it? ------------- PR: https://git.openjdk.org/jdk/pull/9073 From duke at openjdk.org Sat Oct 22 03:50:52 2022 From: duke at openjdk.org (Mkkebe) Date: Sat, 22 Oct 2022 03:50:52 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: <9zGpQVcVnb9jCou25c_iXpCgitg-sXdaeeLjHUjQOmU=.a3323a68-b27a-486c-8168-c52f34434ced@github.com> On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table Marked as reviewed by Mkkebe at github.com (no known OpenJDK username). ------------- PR: https://git.openjdk.org/jdk/pull/10025 From jiefu at openjdk.org Sat Oct 22 03:51:53 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sat, 22 Oct 2022 03:51:53 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: <1mlJBKT0bt03Ovuj6qLPNdrbgngxzUagyERbYXNBmXM=.4f599499-c428-4c05-baba-3c017ea4bfdf@github.com> On Fri, 21 Oct 2022 05:20:34 GMT, Vladimir Kozlov wrote: > > How about adding the vector api tests in GHA in a separate PR? > > Yes, it is definitely separate changes. Thanks @vnkozlov . Will do it next week. ------------- PR: https://git.openjdk.org/jdk/pull/10807 From jiefu at openjdk.org Sat Oct 22 03:53:20 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sat, 22 Oct 2022 03:53:20 GMT Subject: Integrated: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 03:55:50 GMT, Jie Fu wrote: > Hi all, > > Many vector api tests fail on x86_32 after JDK-8293409 due to computing incorrect results. > The reason is that `generate_iota_indices` was updated only for x86_64 in JDK-8293409. > So let's fix it for x86_32. > > Testing: > - vector api tests on x86_32, all passed > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: adad59ee Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/adad59ee11b84958f127d04835762b4f0fd5fb21 Stats: 91 lines in 1 file changed: 91 ins; 0 del; 0 mod 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 Reviewed-by: xgong, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10807 From bulasevich at openjdk.org Sun Oct 23 09:11:45 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Sun, 23 Oct 2022 09:11:45 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Message-ID: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes ------------- Commit messages: - 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Changes: https://git.openjdk.org/jdk/pull/10392/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293999 Stats: 17 lines in 2 files changed: 17 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10392.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10392/head:pull/10392 PR: https://git.openjdk.org/jdk/pull/10392 From aturbanov at openjdk.org Sun Oct 23 11:28:41 2022 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Sun, 23 Oct 2022 11:28:41 GMT Subject: RFR: JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge In-Reply-To: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> References: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> Message-ID: On Fri, 21 Oct 2022 13:53:51 GMT, Tobias Holenstein wrote: > Outgoing edges in IGV are organized like trees. When hovering with the mouse over a segment of an edge, the edge is highlighted all the way up the the source node, as well as all the way down to the leave nodes. This works as expected. The nodes at the source and leaves of the highlighted segments are highlighted as well. There is a bug that instead of only highlighting the leave nodes in the subtree, IGV highlights all leave nodes: > ![before](https://user-images.githubusercontent.com/71546117/197223077-962a5d97-c1c8-4720-9983-295e07468be9.png) > > # Solution > The segments in the edge tree are `LineWidget` objects. Each `LineWidget` has a single `LineWidget` `predecessor` and list of `LineWidget` `successors`. When hovering over a line segment the `notifyStateChanged` function is called in the corresponding `LineWidget` : `predecessor` and `successors` `LineWidget` are recursively visited and highlighted here. The nodes (of super type `Vertex`) are new highlighted with a new `highlightVertices` function instead of using recursion. > `highlightVertices(boolean enable)` uses the list of `connections` which already contains all the one-to-one connections between src/dest nodes that go through a single `LineWidget` segment. This gives as the `Vertex` nodes of the root as well as the leaves in the subtree of the hovered `LineWidget` segment. > > Now the highlighting of the leave nodes works as expected: > ![ex1](https://user-images.githubusercontent.com/71546117/197225776-eb7cba50-6f6a-4f47-a91f-3b793021fdae.png) > ![ex2](https://user-images.githubusercontent.com/71546117/197225789-d9510f36-d89a-44cd-ab69-1341edcafdfb.png) src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/widgets/LineWidget.java line 289: > 287: } > 288: } > 289: if(enable) { Suggestion: if (enable) { ------------- PR: https://git.openjdk.org/jdk/pull/10815 From aturbanov at openjdk.org Sun Oct 23 12:55:05 2022 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Sun, 23 Oct 2022 12:55:05 GMT Subject: RFR: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 11:13:35 GMT, Tobias Holenstein wrote: > IGV crashed when the user clicked on any edge in the graph because the `select` method of the `SelectProvider` in `LineWidget.java` was faulty. > > # Implementation > - `ActionFactory.createSelectAction` was changed to `CustomSelectAction` since it also supports to invert the selection with `Ctrl/CMD` > - The `select` method gets called when the user clicks on an edge. `LineWidget` represents a single connection going out of a node and connecting to one or more nodes (one-to-many). `LineWidget` has a list of `Connection`'s that each represent a single link between two nodes (one-to-one). For each `connection` we collect the from/to `Vertex` (superclass for a node) and put them into a set that we then use to select the nodes that are connected to the user-clicked `LineWidget`. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/widgets/LineWidget.java line 130: > 128: > 129: @Override > 130: public boolean isAimingAllowed(Widget widget, Point localLocation, boolean invertSelection) { Suggestion: public boolean isAimingAllowed(Widget widget, Point localLocation, boolean invertSelection) { src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/widgets/LineWidget.java line 135: > 133: > 134: @Override > 135: public boolean isSelectionAllowed(Widget widget, Point localLocation, boolean invertSelection) { Suggestion: public boolean isSelectionAllowed(Widget widget, Point localLocation, boolean invertSelection) { src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/widgets/LineWidget.java line 140: > 138: > 139: @Override > 140: public void select(Widget widget, Point localLocation, boolean invertSelection) { Suggestion: public void select(Widget widget, Point localLocation, boolean invertSelection) { ------------- PR: https://git.openjdk.org/jdk/pull/10760 From xgong at openjdk.org Mon Oct 24 01:41:48 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 24 Oct 2022 01:41:48 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 14:00:51 GMT, Stuart Monteith wrote: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% > > > These methods can bed specifically tested with: > `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` Looks good to me! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/10537 From eliu at openjdk.org Mon Oct 24 03:10:52 2022 From: eliu at openjdk.org (Eric Liu) Date: Mon, 24 Oct 2022 03:10:52 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v3] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:27:34 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Changed the modifier order preference in JTREG test LGTM. ------------- Marked as reviewed by eliu (Committer). PR: https://git.openjdk.org/jdk/pull/10407 From tholenstein at openjdk.org Mon Oct 24 07:22:14 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 07:22:14 GMT Subject: RFR: JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge [v2] In-Reply-To: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> References: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> Message-ID: > Outgoing edges in IGV are organized like trees. When hovering with the mouse over a segment of an edge, the edge is highlighted all the way up the the source node, as well as all the way down to the leave nodes. This works as expected. The nodes at the source and leaves of the highlighted segments are highlighted as well. There is a bug that instead of only highlighting the leave nodes in the subtree, IGV highlights all leave nodes: > ![before](https://user-images.githubusercontent.com/71546117/197223077-962a5d97-c1c8-4720-9983-295e07468be9.png) > > # Solution > The segments in the edge tree are `LineWidget` objects. Each `LineWidget` has a single `LineWidget` `predecessor` and list of `LineWidget` `successors`. When hovering over a line segment the `notifyStateChanged` function is called in the corresponding `LineWidget` : `predecessor` and `successors` `LineWidget` are recursively visited and highlighted here. The nodes (of super type `Vertex`) are new highlighted with a new `highlightVertices` function instead of using recursion. > `highlightVertices(boolean enable)` uses the list of `connections` which already contains all the one-to-one connections between src/dest nodes that go through a single `LineWidget` segment. This gives as the `Vertex` nodes of the root as well as the leaves in the subtree of the hovered `LineWidget` segment. > > Now the highlighting of the leave nodes works as expected: > ![ex1](https://user-images.githubusercontent.com/71546117/197225776-eb7cba50-6f6a-4f47-a91f-3b793021fdae.png) > ![ex2](https://user-images.githubusercontent.com/71546117/197225789-d9510f36-d89a-44cd-ab69-1341edcafdfb.png) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: whitespace Co-authored-by: Andrey Turbanov ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10815/files - new: https://git.openjdk.org/jdk/pull/10815/files/cb318879..69093c3d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10815&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10815&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10815.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10815/head:pull/10815 PR: https://git.openjdk.org/jdk/pull/10815 From tholenstein at openjdk.org Mon Oct 24 07:30:10 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 07:30:10 GMT Subject: RFR: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph [v2] In-Reply-To: References: Message-ID: > IGV crashed when the user clicked on any edge in the graph because the `select` method of the `SelectProvider` in `LineWidget.java` was faulty. > > # Implementation > - `ActionFactory.createSelectAction` was changed to `CustomSelectAction` since it also supports to invert the selection with `Ctrl/CMD` > - The `select` method gets called when the user clicks on an edge. `LineWidget` represents a single connection going out of a node and connecting to one or more nodes (one-to-many). `LineWidget` has a list of `Connection`'s that each represent a single link between two nodes (one-to-one). For each `connection` we collect the from/to `Vertex` (superclass for a node) and put them into a set that we then use to select the nodes that are connected to the user-clicked `LineWidget`. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: whitespace Co-authored-by: Andrey Turbanov ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10760/files - new: https://git.openjdk.org/jdk/pull/10760/files/9dac84c3..e326b033 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10760&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10760&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10760.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10760/head:pull/10760 PR: https://git.openjdk.org/jdk/pull/10760 From tholenstein at openjdk.org Mon Oct 24 07:35:43 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 07:35:43 GMT Subject: RFR: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph [v3] In-Reply-To: References: Message-ID: > IGV crashed when the user clicked on any edge in the graph because the `select` method of the `SelectProvider` in `LineWidget.java` was faulty. > > # Implementation > - `ActionFactory.createSelectAction` was changed to `CustomSelectAction` since it also supports to invert the selection with `Ctrl/CMD` > - The `select` method gets called when the user clicks on an edge. `LineWidget` represents a single connection going out of a node and connecting to one or more nodes (one-to-many). `LineWidget` has a list of `Connection`'s that each represent a single link between two nodes (one-to-one). For each `connection` we collect the from/to `Vertex` (superclass for a node) and put them into a set that we then use to select the nodes that are connected to the user-clicked `LineWidget`. Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: - whitespace Co-authored-by: Andrey Turbanov - whitespace Co-authored-by: Andrey Turbanov ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10760/files - new: https://git.openjdk.org/jdk/pull/10760/files/e326b033..cd327309 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10760&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10760&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10760.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10760/head:pull/10760 PR: https://git.openjdk.org/jdk/pull/10760 From rrich at openjdk.org Mon Oct 24 08:01:52 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 24 Oct 2022 08:01:52 GMT Subject: RFR: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode [v2] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 20:55:12 GMT, Richard Reingruber wrote: >> With `StressReflectiveCode` C2 has inexact type information which can prevent ea >> based optimizations (see `ConnectionGraph::add_call_node()`) >> >> This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag >> `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations >> of allocations nor deoptimization of corresponding frames upon debugger access. >> >> Tested on the standard platforms with fastdebug and release builds. >> >> >> make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Skip all test cases if StressReflectiveCode is enabled Thanks for reviewing. Richard. ------------- PR: https://git.openjdk.org/jdk/pull/10769 From rrich at openjdk.org Mon Oct 24 08:03:20 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 24 Oct 2022 08:03:20 GMT Subject: Integrated: 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 15:43:17 GMT, Richard Reingruber wrote: > With `StressReflectiveCode` C2 has inexact type information which can prevent ea > based optimizations (see `ConnectionGraph::add_call_node()`) > > This pr changes the test jdk/com/sun/jdi/EATests.java to read the flag > `StressReflectiveCode`. If enabled it shall neither expect ea based optimizations > of allocations nor deoptimization of corresponding frames upon debugger access. > > Tested on the standard platforms with fastdebug and release builds. > > > make test TEST=test/jdk/com/sun/jdi/EATests.java TEST_VM_OPTS="-XX:+IgnoreUnrecognizedVMOptions -XX:+StressReflectiveCode" This pull request has now been integrated. Changeset: 08d3ef4f Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/08d3ef4fe60460d94b0a2db0b6671adc56a6653c Stats: 16 lines in 1 file changed: 16 ins; 0 del; 0 mod 8295413: com/sun/jdi/EATests.java fails with compiler flag -XX:+StressReflectiveCode Reviewed-by: lmesnik, kvn, sspitsyn ------------- PR: https://git.openjdk.org/jdk/pull/10769 From yyang at openjdk.org Mon Oct 24 08:13:06 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 24 Oct 2022 08:13:06 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v3] In-Reply-To: References: Message-ID: > Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: > > ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) > > The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: > > The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: > > https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 > (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). > > There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). > > 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] > > > After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. > > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) > ... > > The well-formed IR looks like this: > ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) > > Thanks for your patience. Yi Yang has updated the pull request incrementally with two additional commits since the last revision: - fix - always clone the Phi with address type ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9777/files - new: https://git.openjdk.org/jdk/pull/9777/files/063d2468..2d9c3c56 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9777&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9777&range=01-02 Stats: 27 lines in 3 files changed: 1 ins; 24 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9777.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9777/head:pull/9777 PR: https://git.openjdk.org/jdk/pull/9777 From dnsimon at openjdk.org Mon Oct 24 08:40:54 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 24 Oct 2022 08:40:54 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes In-Reply-To: References: Message-ID: On Thu, 22 Sep 2022 14:30:10 GMT, Boris Ulasevich wrote: > 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes src/hotspot/share/asm/codeBuffer.hpp line 755: > 753: if (EnableJVMCI) { > 754: // Graal vectorization requires larger aligned constants > 755: return 64; This means all Graal installed code will pay a penalty even though most installed code does not include constants that need such large alignment. It would be preferable to allow a compiler to specify the alignment requirement per nmethod. ------------- PR: https://git.openjdk.org/jdk/pull/10392 From thartmann at openjdk.org Mon Oct 24 09:06:55 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 09:06:55 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v3] In-Reply-To: References: Message-ID: <9XWZNcNcmELCLXDwpuNgpztPrw8xXajJQcj_daf4jhU=.4af44336-021f-4688-9a56-6a90c8e12f53@github.com> On Fri, 21 Oct 2022 20:20:58 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > further restrict UsePolyIntrinsics with supports_avx512vlbw Thanks, I'll re-run testing. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From rcastanedalo at openjdk.org Mon Oct 24 09:39:58 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Oct 2022 09:39:58 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v5] In-Reply-To: References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: <3i5VcGTVGwBrQMhaISUswB5FNWGioSlH2dNNzF4jxJo=.d35b6465-cef7-4e01-80d8-62f5f948359e@github.com> On Wed, 19 Oct 2022 08:19:16 GMT, Christian Hagedorn wrote: >> This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: >> >> https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 >> >> The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. >> >> ## How does it work? >> >> ### Basic idea >> There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: >> >> >> int iFld; >> >> @Test >> @IR(counts = {IRNode.STORE_I, "1"}, >> phase = {CompilePhase.AFTER_PARSING, // Fails >> CompilePhase.ITER_GVN1}) // Works >> public void optimizeStores() { >> iFld = 42; >> iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 >> } >> >> In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: >> >> 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" >> > Phase "After Parsing": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" >> - Failed comparison: [found] 2 = 1 [given] >> - Matched nodes (2): >> * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) >> * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) >> >> >> More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. >> >> ### CompilePhase.DEFAULT - default compile phase >> The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). >> >> Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. >> >> Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. >> >> ### Different regexes for the same IRNode entry >> A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: >> >> - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node >> public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node >> >> - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; >> static { >> String idealIndependentRegex = START + "Allocate" + MID + END; >> String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; >> allocNodes(ALLOC, idealIndependentRegex, optoRegex); >> } >> >> **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** >> >> ### Using the IRNode entries correctly >> The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: >> - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). >> - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). >> - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. >> >> ## General Changes >> The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: >> >> - Added more packages to better group related classes together. >> - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. >> - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). >> - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) >> - Cleaned up and refactored a lot of code to use this new design. >> - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. >> - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. >> - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. >> - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. >> - Replaced implementation inheritance by interfaces. >> - Improved encapsulation of object data. >> - Updated README and many comments/class descriptions to reflect this new feature. >> - Added new IR framework tests >> >> ## Testing >> - Normal tier testing. >> - Applying the patch to Valhalla to perform tier testing. >> - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 83 commits: > > - Fix TestVectorConditionalMove > - Merge branch 'master' into JDK-8280378 > - Hao's patch to address review comments > - Roberto's review comments > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java > > Co-authored-by: Roberto Casta?eda Lozano > - Merge branch 'master' into JDK-8280378 > - Fix missing counts indentation in failure messages > - Update comments > - ... and 73 more: https://git.openjdk.org/jdk/compare/f502ab85...ae7190c4 Marked as reviewed by rcastanedalo (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10695 From rcastanedalo at openjdk.org Mon Oct 24 10:04:56 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Oct 2022 10:04:56 GMT Subject: RFR: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph [v3] In-Reply-To: References: Message-ID: <-G_EjwQP87JOmty__hTUAsfx0txQSlP1GJ7Pbcb5o0o=.88c980a1-83b8-4acb-8fa1-09ab30639be3@github.com> On Mon, 24 Oct 2022 07:35:43 GMT, Tobias Holenstein wrote: >> IGV crashed when the user clicked on any edge in the graph because the `select` method of the `SelectProvider` in `LineWidget.java` was faulty. >> >> # Implementation >> - `ActionFactory.createSelectAction` was changed to `CustomSelectAction` since it also supports to invert the selection with `Ctrl/CMD` >> - The `select` method gets called when the user clicks on an edge. `LineWidget` represents a single connection going out of a node and connecting to one or more nodes (one-to-many). `LineWidget` has a list of `Connection`'s that each represent a single link between two nodes (one-to-one). For each `connection` we collect the from/to `Vertex` (superclass for a node) and put them into a set that we then use to select the nodes that are connected to the user-clicked `LineWidget`. > > Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: > > - whitespace > > Co-authored-by: Andrey Turbanov > - whitespace > > Co-authored-by: Andrey Turbanov Looks good, thanks for fixing this! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/10760 From thartmann at openjdk.org Mon Oct 24 10:09:11 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 10:09:11 GMT Subject: RFR: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph [v3] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 07:35:43 GMT, Tobias Holenstein wrote: >> IGV crashed when the user clicked on any edge in the graph because the `select` method of the `SelectProvider` in `LineWidget.java` was faulty. >> >> # Implementation >> - `ActionFactory.createSelectAction` was changed to `CustomSelectAction` since it also supports to invert the selection with `Ctrl/CMD` >> - The `select` method gets called when the user clicks on an edge. `LineWidget` represents a single connection going out of a node and connecting to one or more nodes (one-to-many). `LineWidget` has a list of `Connection`'s that each represent a single link between two nodes (one-to-one). For each `connection` we collect the from/to `Vertex` (superclass for a node) and put them into a set that we then use to select the nodes that are connected to the user-clicked `LineWidget`. > > Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: > > - whitespace > > Co-authored-by: Andrey Turbanov > - whitespace > > Co-authored-by: Andrey Turbanov Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10760 From thartmann at openjdk.org Mon Oct 24 10:15:45 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 10:15:45 GMT Subject: RFR: JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge [v2] In-Reply-To: References: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> Message-ID: On Mon, 24 Oct 2022 07:22:14 GMT, Tobias Holenstein wrote: >> Outgoing edges in IGV are organized like trees. When hovering with the mouse over a segment of an edge, the edge is highlighted all the way up the the source node, as well as all the way down to the leave nodes. This works as expected. The nodes at the source and leaves of the highlighted segments are highlighted as well. There is a bug that instead of only highlighting the leave nodes in the subtree, IGV highlights all leave nodes: >> ![before](https://user-images.githubusercontent.com/71546117/197223077-962a5d97-c1c8-4720-9983-295e07468be9.png) >> >> # Solution >> The segments in the edge tree are `LineWidget` objects. Each `LineWidget` has a single `LineWidget` `predecessor` and list of `LineWidget` `successors`. When hovering over a line segment the `notifyStateChanged` function is called in the corresponding `LineWidget` : `predecessor` and `successors` `LineWidget` are recursively visited and highlighted here. The nodes (of super type `Vertex`) are new highlighted with a new `highlightVertices` function instead of using recursion. >> `highlightVertices(boolean enable)` uses the list of `connections` which already contains all the one-to-one connections between src/dest nodes that go through a single `LineWidget` segment. This gives as the `Vertex` nodes of the root as well as the leaves in the subtree of the hovered `LineWidget` segment. >> >> Now the highlighting of the leave nodes works as expected: >> ![ex1](https://user-images.githubusercontent.com/71546117/197225776-eb7cba50-6f6a-4f47-a91f-3b793021fdae.png) >> ![ex2](https://user-images.githubusercontent.com/71546117/197225789-d9510f36-d89a-44cd-ab69-1341edcafdfb.png) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > whitespace > > Co-authored-by: Andrey Turbanov Works well and looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10815 From adinn at openjdk.org Mon Oct 24 10:15:48 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Oct 2022 10:15:48 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 14:00:51 GMT, Stuart Monteith wrote: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% > > > These methods can bed specifically tested with: > `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` Looks good. ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10537 From thartmann at openjdk.org Mon Oct 24 10:40:47 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 10:40:47 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 11:53:31 GMT, Tobias Holenstein wrote: > # Problem > - The history of the Undo Action did not start at zero when opening a new graph. > - In the "Show sea of nodes" view you could press Undo once before doing any changes. > - In the "Show control view graph" you could even go back many times when opening a new graph, ending up in wired stated of the graph. > - Selecting a node should not be added to the history. You could select X different nodes in a row and then go back X times in history without that anything changed. > > # Overview UndoRedo > Different `ChangedEvent` events are fired in `DiagramViewModel` that trigger the `DiagramScene` to update its state. After those updates `DiagramScene` calls `addUndo()` to save the new state of the `DiagramViewModel` as well as the scrolling position of the `DiagramScene` in a `DiagramUndoRedo` object. `undo()` / `redo()` can then be used to restore the saved state. > > One problem was that selecting nodes triggered a `ChangedEvent` event that caused a recording to the history but the selection itself was not part of the saved state. Another problem was that switching between different views (CFG view, sea-of-nodes view) as well as applying filters also triggered an `addUndo()` - The recording of filters and view switches is however not implemented. > > # Solution > We now only record when the graph itself changes (not the view) - like when opening a new graph or creating a difference graph. And we also record changes in the selection of the visible nodes. > > We refactored the `ChangedEvent` events in `DiagramViewModel`: > - `graphChangedEvent` now is fired when the graph changes and triggers an `addUndo()` > - `hiddenNodesChangedEvent` is now fired when the selection of visible nodes changes and also triggers an `addUndo()` > > `DiagramUndoRedo` previously stored a deep copy of the `DiagramViewModel` for every datapoint in the history. Besides using a lot of memory, most of the stored objects were redundant or not used. Now, we introduced a new `ModelState` object that stores the id of the visible nodes in `Set hiddenNodes` as well as the opened graph. For the graph we only need `int firstPos` and `int secondPos` which indicate the position of the difference graph within the group. If we don't use a difference graph `firstPos` == `secondPos`. I gave this a quick test and clicking on a different phase and then reverting back and forth to the previous phase via the undo/redo buttons does not work anymore after doing it twice. ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10813 From thartmann at openjdk.org Mon Oct 24 10:44:54 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 10:44:54 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 11:53:31 GMT, Tobias Holenstein wrote: > # Problem > - The history of the Undo Action did not start at zero when opening a new graph. > - In the "Show sea of nodes" view you could press Undo once before doing any changes. > - In the "Show control view graph" you could even go back many times when opening a new graph, ending up in wired stated of the graph. > - Selecting a node should not be added to the history. You could select X different nodes in a row and then go back X times in history without that anything changed. > > # Overview UndoRedo > Different `ChangedEvent` events are fired in `DiagramViewModel` that trigger the `DiagramScene` to update its state. After those updates `DiagramScene` calls `addUndo()` to save the new state of the `DiagramViewModel` as well as the scrolling position of the `DiagramScene` in a `DiagramUndoRedo` object. `undo()` / `redo()` can then be used to restore the saved state. > > One problem was that selecting nodes triggered a `ChangedEvent` event that caused a recording to the history but the selection itself was not part of the saved state. Another problem was that switching between different views (CFG view, sea-of-nodes view) as well as applying filters also triggered an `addUndo()` - The recording of filters and view switches is however not implemented. > > # Solution > We now only record when the graph itself changes (not the view) - like when opening a new graph or creating a difference graph. And we also record changes in the selection of the visible nodes. > > We refactored the `ChangedEvent` events in `DiagramViewModel`: > - `graphChangedEvent` now is fired when the graph changes and triggers an `addUndo()` > - `hiddenNodesChangedEvent` is now fired when the selection of visible nodes changes and also triggers an `addUndo()` > > `DiagramUndoRedo` previously stored a deep copy of the `DiagramViewModel` for every datapoint in the history. Besides using a lot of memory, most of the stored objects were redundant or not used. Now, we introduced a new `ModelState` object that stores the id of the visible nodes in `Set hiddenNodes` as well as the opened graph. For the graph we only need `int firstPos` and `int secondPos` which indicate the position of the difference graph within the group. If we don't use a difference graph `firstPos` == `secondPos`. In general, it seems that when going backwards (undo) multiple steps, going forwards (redo) does not work anymore after one step forward. ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10813 From tholenstein at openjdk.org Mon Oct 24 11:58:51 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 11:58:51 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action [v2] In-Reply-To: References: Message-ID: > # Problem > - The history of the Undo Action did not start at zero when opening a new graph. > - In the "Show sea of nodes" view you could press Undo once before doing any changes. > - In the "Show control view graph" you could even go back many times when opening a new graph, ending up in wired stated of the graph. > - Selecting a node should not be added to the history. You could select X different nodes in a row and then go back X times in history without that anything changed. > > # Overview UndoRedo > Different `ChangedEvent` events are fired in `DiagramViewModel` that trigger the `DiagramScene` to update its state. After those updates `DiagramScene` calls `addUndo()` to save the new state of the `DiagramViewModel` as well as the scrolling position of the `DiagramScene` in a `DiagramUndoRedo` object. `undo()` / `redo()` can then be used to restore the saved state. > > One problem was that selecting nodes triggered a `ChangedEvent` event that caused a recording to the history but the selection itself was not part of the saved state. Another problem was that switching between different views (CFG view, sea-of-nodes view) as well as applying filters also triggered an `addUndo()` - The recording of filters and view switches is however not implemented. > > # Solution > We now only record when the graph itself changes (not the view) - like when opening a new graph or creating a difference graph. And we also record changes in the selection of the visible nodes. > > We refactored the `ChangedEvent` events in `DiagramViewModel`: > - `graphChangedEvent` now is fired when the graph changes and triggers an `addUndo()` > - `hiddenNodesChangedEvent` is now fired when the selection of visible nodes changes and also triggers an `addUndo()` > > `DiagramUndoRedo` previously stored a deep copy of the `DiagramViewModel` for every datapoint in the history. Besides using a lot of memory, most of the stored objects were redundant or not used. Now, we introduced a new `ModelState` object that stores the id of the visible nodes in `Set hiddenNodes` as well as the opened graph. For the graph we only need `int firstPos` and `int secondPos` which indicate the position of the difference graph within the group. If we don't use a difference graph `firstPos` == `secondPos`. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: disable undoRedo during execution of an undo/redo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10813/files - new: https://git.openjdk.org/jdk/pull/10813/files/a4667b06..4a39b17a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10813&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10813&range=00-01 Stats: 10 lines in 1 file changed: 8 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10813.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10813/head:pull/10813 PR: https://git.openjdk.org/jdk/pull/10813 From thartmann at openjdk.org Mon Oct 24 12:11:04 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 12:11:04 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 11:12:45 GMT, Bhavana Kilambi wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> 8288107: Auto-vectorization for integer min/max >> >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >> Before this patch: >> aarch64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> x86-64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >> After this patch: >> aarch64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> x86-64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : >> aarch64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> x86-64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> There is no degradation when vectorization is disabled. > > Added a new commit with the code related to MaxINode::Ideal tests stripped off and only retained the code related to generating MinI/MaxI node for Math.min/max intrinsics. This introduced a regression: [JDK-8294816](https://bugs.openjdk.org/browse/JDK-8294816) @Bhavana-Kilambi, could you please have a look? ------------- PR: https://git.openjdk.org/jdk/pull/9466 From tholenstein at openjdk.org Mon Oct 24 12:31:58 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 12:31:58 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action [v2] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 10:40:55 GMT, Tobias Hartmann wrote: > In general, it seems that when going backwards (undo) multiple steps, going forwards (redo) does not work anymore after one step forward. Hi @TobiHartmann Thanks for trying it out and reporting. It should be fixed now! The problem was that during `undo()`/ `redo()` we recorded new states to the history with `addUndo()` The fix is to disable `addUndo()` while performing `undo()`/ `redo()` When we now call `redo()` should we have now the following behavior: 1. (1)->(2)->**(3)** → _undo (2), redo (-)_ - we select _undo (2)_ 2. (1)->**(2)** → _undo (1), redo (3)_ - if we now redo (3) we end up in 1. - if we change the selection or the graph we override state (3) in the history and end up in 3. 3. (1)->(2)->**(4)** → _undo (2), redo (-)_ - we select _undo (2)_ 4. (1)->**(2)** → _undo (1), redo (4)_ - state (3) cannot be reached anymore because we rewrote the history (Nr) represents a saved state, -> is a recording with `addUndo()`, right of → are the possible actions at the **bold** state ------------- PR: https://git.openjdk.org/jdk/pull/10813 From bkilambi at openjdk.org Mon Oct 24 13:18:01 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 24 Oct 2022 13:18:01 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 10:44:52 GMT, Bhavana Kilambi wrote: >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >>

Before this patch >> >> **aarch64:** >> ``` >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >>
>> >>
After this patch >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> >>
>> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. >> >>
Performance numbers >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> >>
>> >> There is no degradation when vectorization is disabled. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > 8288107: Auto-vectorization for integer min/max > > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. Hi, sure will look into this .. ------------- PR: https://git.openjdk.org/jdk/pull/9466 From thartmann at openjdk.org Mon Oct 24 13:20:50 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 13:20:50 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action [v2] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 11:58:51 GMT, Tobias Holenstein wrote: >> # Problem >> - The history of the Undo Action did not start at zero when opening a new graph. >> - In the "Show sea of nodes" view you could press Undo once before doing any changes. >> - In the "Show control view graph" you could even go back many times when opening a new graph, ending up in wired state of the graph. >> - Selecting a node should not be added to the history. You could select X different nodes in a row and then go back X times in history without that anything changed. >> >> # Overview UndoRedo >> Different `ChangedEvent` events are fired in `DiagramViewModel` that trigger the `DiagramScene` to update its state. After those updates `DiagramScene` calls `addUndo()` to save the new state of the `DiagramViewModel` as well as the scrolling position of the `DiagramScene` in a `DiagramUndoRedo` object. `undo()` / `redo()` can then be used to restore the saved state. >> >> One problem was that selecting nodes triggered a `ChangedEvent` event that caused a recording to the history but the selection itself was not part of the saved state. Another problem was that switching between different views (CFG view, sea-of-nodes view) as well as applying filters also triggered an `addUndo()` - The recording of filters and view switches is however not implemented. >> >> # Solution >> We now only record when the graph itself changes (not the view) - like when opening a new graph or creating a difference graph. And we also record changes in the selection of the visible nodes. >> >> We refactored the `ChangedEvent` events in `DiagramViewModel`: >> - `graphChangedEvent` now is fired when the graph changes and triggers an `addUndo()` >> - `hiddenNodesChangedEvent` is now fired when the selection of visible nodes changes and also triggers an `addUndo()` >> >> `DiagramUndoRedo` previously stored a deep copy of the `DiagramViewModel` for every datapoint in the history. Besides using a lot of memory, most of the stored objects were redundant or not used. Now, we introduced a new `ModelState` object that stores the id of the visible nodes in `Set hiddenNodes` as well as the opened graph. For the graph we only need `int firstPos` and `int secondPos` which indicate the position of the difference graph within the group. If we don't use a difference graph `firstPos` == `secondPos`. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > disable undoRedo during execution of an undo/redo Thanks, the new version works like a charm. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10813 From thartmann at openjdk.org Mon Oct 24 13:28:01 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Oct 2022 13:28:01 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: <00xVL_xq6POBib-8x5oB0GPkE57-7hNtNL2rOCXCdPE=.d891ffa2-9409-4089-ac16-15fdca2964bb@github.com> On Fri, 15 Jul 2022 10:44:52 GMT, Bhavana Kilambi wrote: >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >>
Before this patch >> >> **aarch64:** >> ``` >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >>
>> >>
After this patch >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> >>
>> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. >> >>
Performance numbers >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> >>
>> >> There is no degradation when vectorization is disabled. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > 8288107: Auto-vectorization for integer min/max > > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/9466 From duke at openjdk.org Mon Oct 24 13:41:12 2022 From: duke at openjdk.org (SuperCoder79) Date: Mon, 24 Oct 2022 13:41:12 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v5] In-Reply-To: References: Message-ID: > Hello, > I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: > * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. > * The removal of the memory load would have a beneficial effect in cache bound situations. > * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. > > As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. > > I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. > > Thanks for your time, > Jasmine SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: Apply changes from code review - Added interpreter assert ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9642/files - new: https://git.openjdk.org/jdk/pull/9642/files/d4303fad..674124d3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9642&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9642&range=03-04 Stats: 9 lines in 2 files changed: 5 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/9642.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9642/head:pull/9642 PR: https://git.openjdk.org/jdk/pull/9642 From rcastanedalo at openjdk.org Mon Oct 24 13:58:47 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Oct 2022 13:58:47 GMT Subject: RFR: JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge [v2] In-Reply-To: References: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> Message-ID: On Mon, 24 Oct 2022 07:22:14 GMT, Tobias Holenstein wrote: >> Outgoing edges in IGV are organized like trees. When hovering with the mouse over a segment of an edge, the edge is highlighted all the way up the the source node, as well as all the way down to the leave nodes. This works as expected. The nodes at the source and leaves of the highlighted segments are highlighted as well. There is a bug that instead of only highlighting the leave nodes in the subtree, IGV highlights all leave nodes: >> ![before](https://user-images.githubusercontent.com/71546117/197223077-962a5d97-c1c8-4720-9983-295e07468be9.png) >> >> # Solution >> The segments in the edge tree are `LineWidget` objects. Each `LineWidget` has a single `LineWidget` `predecessor` and list of `LineWidget` `successors`. When hovering over a line segment the `notifyStateChanged` function is called in the corresponding `LineWidget` : `predecessor` and `successors` `LineWidget` are recursively visited and highlighted here. The nodes (of super type `Vertex`) are new highlighted with a new `highlightVertices` function instead of using recursion. >> `highlightVertices(boolean enable)` uses the list of `connections` which already contains all the one-to-one connections between src/dest nodes that go through a single `LineWidget` segment. This gives as the `Vertex` nodes of the root as well as the leaves in the subtree of the hovered `LineWidget` segment. >> >> Now the highlighting of the leave nodes works as expected: >> ![ex1](https://user-images.githubusercontent.com/71546117/197225776-eb7cba50-6f6a-4f47-a91f-3b793021fdae.png) >> ![ex2](https://user-images.githubusercontent.com/71546117/197225789-d9510f36-d89a-44cd-ab69-1341edcafdfb.png) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > whitespace > > Co-authored-by: Andrey Turbanov Looks good! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/10815 From tholenstein at openjdk.org Mon Oct 24 14:13:50 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 14:13:50 GMT Subject: RFR: JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge [v2] In-Reply-To: References: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> Message-ID: On Mon, 24 Oct 2022 07:22:14 GMT, Tobias Holenstein wrote: >> Outgoing edges in IGV are organized like trees. When hovering with the mouse over a segment of an edge, the edge is highlighted all the way up the the source node, as well as all the way down to the leave nodes. This works as expected. The nodes at the source and leaves of the highlighted segments are highlighted as well. There is a bug that instead of only highlighting the leave nodes in the subtree, IGV highlights all leave nodes: >> ![before](https://user-images.githubusercontent.com/71546117/197223077-962a5d97-c1c8-4720-9983-295e07468be9.png) >> >> # Solution >> The segments in the edge tree are `LineWidget` objects. Each `LineWidget` has a single `LineWidget` `predecessor` and list of `LineWidget` `successors`. When hovering over a line segment the `notifyStateChanged` function is called in the corresponding `LineWidget` : `predecessor` and `successors` `LineWidget` are recursively visited and highlighted here. The nodes (of super type `Vertex`) are new highlighted with a new `highlightVertices` function instead of using recursion. >> `highlightVertices(boolean enable)` uses the list of `connections` which already contains all the one-to-one connections between src/dest nodes that go through a single `LineWidget` segment. This gives as the `Vertex` nodes of the root as well as the leaves in the subtree of the hovered `LineWidget` segment. >> >> Now the highlighting of the leave nodes works as expected: >> ![ex1](https://user-images.githubusercontent.com/71546117/197225776-eb7cba50-6f6a-4f47-a91f-3b793021fdae.png) >> ![ex2](https://user-images.githubusercontent.com/71546117/197225789-d9510f36-d89a-44cd-ab69-1341edcafdfb.png) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > whitespace > > Co-authored-by: Andrey Turbanov Thanks @turbanoff , @TobiHartmann and @robcasloz for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10815 From tholenstein at openjdk.org Mon Oct 24 14:16:57 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 14:16:57 GMT Subject: Integrated: JDK-8295461: IGV: Wrong src/dest nodes highlighted for edge In-Reply-To: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> References: <1ZYB8cu42bWNWIv0w8B69kHk-Vva8FiakNV06HUxM7k=.855fa4ed-c31e-4899-8f4a-4eea3b3c010e@github.com> Message-ID: <48YNkjLo-B0w2CAmnDEqv7UTC0XwUHNofNBGDD61r-o=.279b6fa3-61ab-4bc3-929a-2deb28b0162a@github.com> On Fri, 21 Oct 2022 13:53:51 GMT, Tobias Holenstein wrote: > Outgoing edges in IGV are organized like trees. When hovering with the mouse over a segment of an edge, the edge is highlighted all the way up the the source node, as well as all the way down to the leave nodes. This works as expected. The nodes at the source and leaves of the highlighted segments are highlighted as well. There is a bug that instead of only highlighting the leave nodes in the subtree, IGV highlights all leave nodes: > ![before](https://user-images.githubusercontent.com/71546117/197223077-962a5d97-c1c8-4720-9983-295e07468be9.png) > > # Solution > The segments in the edge tree are `LineWidget` objects. Each `LineWidget` has a single `LineWidget` `predecessor` and list of `LineWidget` `successors`. When hovering over a line segment the `notifyStateChanged` function is called in the corresponding `LineWidget` : `predecessor` and `successors` `LineWidget` are recursively visited and highlighted here. The nodes (of super type `Vertex`) are new highlighted with a new `highlightVertices` function instead of using recursion. > `highlightVertices(boolean enable)` uses the list of `connections` which already contains all the one-to-one connections between src/dest nodes that go through a single `LineWidget` segment. This gives as the `Vertex` nodes of the root as well as the leaves in the subtree of the hovered `LineWidget` segment. > > Now the highlighting of the leave nodes works as expected: > ![ex1](https://user-images.githubusercontent.com/71546117/197225776-eb7cba50-6f6a-4f47-a91f-3b793021fdae.png) > ![ex2](https://user-images.githubusercontent.com/71546117/197225789-d9510f36-d89a-44cd-ab69-1341edcafdfb.png) This pull request has now been integrated. Changeset: 38983857 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/38983857883eb1b8948cb7645e77ecc97c4e4dd5 Stats: 58 lines in 1 file changed: 25 ins; 21 del; 12 mod 8295461: IGV: Wrong src/dest nodes highlighted for edge Reviewed-by: thartmann, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/10815 From tholenstein at openjdk.org Mon Oct 24 14:20:57 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 14:20:57 GMT Subject: RFR: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph [v3] In-Reply-To: <-G_EjwQP87JOmty__hTUAsfx0txQSlP1GJ7Pbcb5o0o=.88c980a1-83b8-4acb-8fa1-09ab30639be3@github.com> References: <-G_EjwQP87JOmty__hTUAsfx0txQSlP1GJ7Pbcb5o0o=.88c980a1-83b8-4acb-8fa1-09ab30639be3@github.com> Message-ID: On Mon, 24 Oct 2022 10:00:49 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: >> >> - whitespace >> >> Co-authored-by: Andrey Turbanov >> - whitespace >> >> Co-authored-by: Andrey Turbanov > > Looks good, thanks for fixing this! Thanks @robcasloz , @TobiHartmann and @turbanoff for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10760 From tholenstein at openjdk.org Mon Oct 24 14:20:58 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 14:20:58 GMT Subject: Integrated: JDK-8294565: IGV: ClassCastException when clicking on an edge in the graph In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 11:13:35 GMT, Tobias Holenstein wrote: > IGV crashed when the user clicked on any edge in the graph because the `select` method of the `SelectProvider` in `LineWidget.java` was faulty. > > # Implementation > - `ActionFactory.createSelectAction` was changed to `CustomSelectAction` since it also supports to invert the selection with `Ctrl/CMD` > - The `select` method gets called when the user clicks on an edge. `LineWidget` represents a single connection going out of a node and connecting to one or more nodes (one-to-many). `LineWidget` has a list of `Connection`'s that each represent a single link between two nodes (one-to-one). For each `connection` we collect the from/to `Vertex` (superclass for a node) and put them into a set that we then use to select the nodes that are connected to the user-clicked `LineWidget`. This pull request has now been integrated. Changeset: c055dfc3 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/c055dfc3ce5fe1cdc3e1a0d5a182df355a40c6b7 Stats: 88 lines in 2 files changed: 18 ins; 15 del; 55 mod 8294565: IGV: ClassCastException when clicking on an edge in the graph Reviewed-by: rcastanedalo, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10760 From rcastanedalo at openjdk.org Mon Oct 24 14:35:56 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Oct 2022 14:35:56 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action [v2] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 11:58:51 GMT, Tobias Holenstein wrote: >> # Problem >> - The history of the Undo Action did not start at zero when opening a new graph. >> - In the "Show sea of nodes" view you could press Undo once before doing any changes. >> - In the "Show control view graph" you could even go back many times when opening a new graph, ending up in wired state of the graph. >> - Selecting a node should not be added to the history. You could select X different nodes in a row and then go back X times in history without that anything changed. >> >> # Overview UndoRedo >> Different `ChangedEvent` events are fired in `DiagramViewModel` that trigger the `DiagramScene` to update its state. After those updates `DiagramScene` calls `addUndo()` to save the new state of the `DiagramViewModel` as well as the scrolling position of the `DiagramScene` in a `DiagramUndoRedo` object. `undo()` / `redo()` can then be used to restore the saved state. >> >> One problem was that selecting nodes triggered a `ChangedEvent` event that caused a recording to the history but the selection itself was not part of the saved state. Another problem was that switching between different views (CFG view, sea-of-nodes view) as well as applying filters also triggered an `addUndo()` - The recording of filters and view switches is however not implemented. >> >> # Solution >> We now only record when the graph itself changes (not the view) - like when opening a new graph or creating a difference graph. And we also record changes in the selection of the visible nodes. >> >> We refactored the `ChangedEvent` events in `DiagramViewModel`: >> - `graphChangedEvent` now is fired when the graph changes and triggers an `addUndo()` >> - `hiddenNodesChangedEvent` is now fired when the selection of visible nodes changes and also triggers an `addUndo()` >> >> `DiagramUndoRedo` previously stored a deep copy of the `DiagramViewModel` for every datapoint in the history. Besides using a lot of memory, most of the stored objects were redundant or not used. Now, we introduced a new `ModelState` object that stores the id of the visible nodes in `Set hiddenNodes` as well as the opened graph. For the graph we only need `int firstPos` and `int secondPos` which indicate the position of the difference graph within the group. If we don't use a difference graph `firstPos` == `secondPos`. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > disable undoRedo during execution of an undo/redo Thanks for making the undo/redo functionality usable! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/10813 From tholenstein at openjdk.org Mon Oct 24 15:04:46 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 15:04:46 GMT Subject: RFR: JDK-8290010: IGV: Fix UndoRedo Action [v2] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 13:18:37 GMT, Tobias Hartmann wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> disable undoRedo during execution of an undo/redo > > Thanks, the new version works like a charm. Looks good to me. Thanks @TobiHartmann and @robcasloz for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10813 From tholenstein at openjdk.org Mon Oct 24 15:08:52 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 24 Oct 2022 15:08:52 GMT Subject: Integrated: JDK-8290010: IGV: Fix UndoRedo Action In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 11:53:31 GMT, Tobias Holenstein wrote: > # Problem > - The history of the Undo Action did not start at zero when opening a new graph. > - In the "Show sea of nodes" view you could press Undo once before doing any changes. > - In the "Show control view graph" you could even go back many times when opening a new graph, ending up in wired state of the graph. > - Selecting a node should not be added to the history. You could select X different nodes in a row and then go back X times in history without that anything changed. > > # Overview UndoRedo > Different `ChangedEvent` events are fired in `DiagramViewModel` that trigger the `DiagramScene` to update its state. After those updates `DiagramScene` calls `addUndo()` to save the new state of the `DiagramViewModel` as well as the scrolling position of the `DiagramScene` in a `DiagramUndoRedo` object. `undo()` / `redo()` can then be used to restore the saved state. > > One problem was that selecting nodes triggered a `ChangedEvent` event that caused a recording to the history but the selection itself was not part of the saved state. Another problem was that switching between different views (CFG view, sea-of-nodes view) as well as applying filters also triggered an `addUndo()` - The recording of filters and view switches is however not implemented. > > # Solution > We now only record when the graph itself changes (not the view) - like when opening a new graph or creating a difference graph. And we also record changes in the selection of the visible nodes. > > We refactored the `ChangedEvent` events in `DiagramViewModel`: > - `graphChangedEvent` now is fired when the graph changes and triggers an `addUndo()` > - `hiddenNodesChangedEvent` is now fired when the selection of visible nodes changes and also triggers an `addUndo()` > > `DiagramUndoRedo` previously stored a deep copy of the `DiagramViewModel` for every datapoint in the history. Besides using a lot of memory, most of the stored objects were redundant or not used. Now, we introduced a new `ModelState` object that stores the id of the visible nodes in `Set hiddenNodes` as well as the opened graph. For the graph we only need `int firstPos` and `int secondPos` which indicate the position of the difference graph within the group. If we don't use a difference graph `firstPos` == `secondPos`. This pull request has now been integrated. Changeset: 5ac6f185 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/5ac6f185eec9efb063bf271516df6529b732a043 Stats: 292 lines in 6 files changed: 44 ins; 167 del; 81 mod 8290010: IGV: Fix UndoRedo Action Reviewed-by: thartmann, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/10813 From eastigeevich at openjdk.org Mon Oct 24 15:28:57 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 24 Oct 2022 15:28:57 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.cpp line 142: > 140: // - the payload of the first byte is 6 bits, the payload of the following bytes is 7 bits > 141: // - the most significant bit in the first byte is occupied by a zero flag > 142: // - each byte has a bit indicating whether it is the last byte in the sequence in each byte bit #6 indicates whether it is the last byte in the sequence ------------- PR: https://git.openjdk.org/jdk/pull/10025 From sviswanathan at openjdk.org Mon Oct 24 18:26:51 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 24 Oct 2022 18:26:51 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v3] In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 20:20:58 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > further restrict UsePolyIntrinsics with supports_avx512vlbw @ascarpino Could you please also take a look at this PR? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Mon Oct 24 18:28:08 2022 From: duke at openjdk.org (SuperCoder79) Date: Mon, 24 Oct 2022 18:28:08 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v5] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 13:41:12 GMT, SuperCoder79 wrote: >> Hello, >> I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: >> * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. >> * The removal of the memory load would have a beneficial effect in cache bound situations. >> * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. >> >> As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. >> >> I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. >> >> Thanks for your time, >> Jasmine > > SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review > > - Added interpreter assert Hi, apologies for the delayed reply but I have fixed the style and have added verification of the optimization against the interpreter. A re-review would be much appreciated. Thanks for your time once again! ------------- PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Mon Oct 24 18:28:09 2022 From: duke at openjdk.org (SuperCoder79) Date: Mon, 24 Oct 2022 18:28:09 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v4] In-Reply-To: <8tzo4InfV5VbQzBpEb-d1RR4r85qbHcISsQ75gQTA5k=.7fa6d029-0372-47c5-9d24-0839fa2678e3@github.com> References: <8tzo4InfV5VbQzBpEb-d1RR4r85qbHcISsQ75gQTA5k=.7fa6d029-0372-47c5-9d24-0839fa2678e3@github.com> Message-ID: On Tue, 27 Sep 2022 18:42:10 GMT, Quan Anh Mai wrote: >> SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply style changes from code review > > src/hotspot/share/opto/mulnode.cpp line 438: > >> 436: //------------------------------Ideal--------------------------------------- >> 437: // Check to see if we are multiplying by a constant 2 and convert to add, then try the regular MulNode::Ideal >> 438: Node *MulFNode::Ideal(PhaseGVN *phase, bool can_reshape) { > > Please use the format `Type* identifier` for new code (in this case it is `PhaseGVN* phase`). The same applies to other places. Done > test/hotspot/jtreg/compiler/c2/irTests/TestMulNodeIdealization.java line 61: > >> 59: @Run(test = "testFloat") >> 60: public void runTestFloat() { >> 61: testFloat(RANDOM.nextFloat()); > > Verification against the execution in the interpreter would be better here. Completed, thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9642 From eastigeevich at openjdk.org Mon Oct 24 19:35:50 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 24 Oct 2022 19:35:50 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.cpp line 191: > 189: write((_curr_byte << (8 - _bit_pos)) | (b >> _bit_pos)); > 190: _curr_byte = (0xff >> (8 - _bit_pos)) & b; > 191: } I am trying to understand this function. Please correct me if I am wrong. `write_byte_impl(b)`, where `b` is non-zero. It combines low `_bit_pos` bits of `_curr_byte` shifted left with high `8 - _bit_pos` bits of `b` shifted right. It stores low `_bit_pos` bits of `b` into `_curr_byte`. `_bit_pos` is not changed. Let's start from `_curr_byte` 0 and `_bit_pos` 0. We have: write_int(0) -> _curr_byte: 0, _bit_pos: 1 write_int(0) -> _curr_byte: 0, _bit_pos: 2 write_int(2) -> write(00100000), _curr_byte: 00000010, _bit_pos: 3 write_int(0) -> _curr_byte: 00000100, _bit_pos: 4 write_int(2) -> write(01001000), _curr_byte: 00000010, _bit_pos: 4 Written bytes: 00100000, 01001000 If there are no more `write_int`, `_curr_byte` will be lost. I don't see what causes it to be written. The similar issue of the lost `_curr_byte` if there are 7 or less `write_int(0)` and `_bit_pos` is 0. I think `_bit_pos` is actually the number of used low bits in `_curr_byte`. Am I correct? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From xxinliu at amazon.com Mon Oct 24 20:11:20 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Mon, 24 Oct 2022 13:11:20 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <4127d57d-ca6f-0cda-13a8-efbdd2ef0501@oracle.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> <0bc75ee6-641f-1145-8fde-6d11e2ec887e@amazon.com> <1da07de9-90d2-d4ad-188e-d7d976009f52@oracle.com> <4768851c-2f3b-69be-ce28-070dae4792c7@amazon.com> <4127d57d-ca6f-0cda-13a8-efbdd2ef0501@oracle.com> Message-ID: <4d996adb-7d10-aa02-3a47-73d043b5013d@amazon.com> hi, Vladimir Ivanov, Your email is my starting point. It's thorough and very insightful. We spent a lot of time trying to crack your questions. The RFC is the summary of what we got. I own your a big thank! Sorry, I still haven't had a clear answer for "2. Move vs Split decision" yet. Stadler's algorithm just materialize a virtual object on demand. After then, its state changes from virtual to materialized. IMHO, I don't think it is optimal placement. I feel the optimal placement should be along domination frontier of the original AllocateNode and Exit. It is like the minimal phi construction and we might borrow the idea from it. I put aside it because it's indeed an optimization problem. Maybe it's not a big deal in common cases. I try to answer your questions inline. On 10/20/22 5:26 PM, Vladimir Ivanov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Hi, > >> I would like to update on this. I manage to get PEA work in Vladimir >> Ivanov's testcase. I put the testcase, assembly and graphs here[1]. >> >> Even though it is a quite simple case, I think it demonstrates that the >> RFC is practical in C2. I proposed 3 major differences from Graal. > > Nice! Also, a very similar (but a much more popular case) should be > escape sites in catch blocks (as reported by JDK-8267532 [1]). > >> 1. The algorithm runs in parser instead of optimizer. >> 2. Prefer clone-and-eliminate strategy rather than >> virtualize-and-materialize. >> 3. Refrain from scalar replacement on-the-fly. > > I don't understand how you plan to implement it solely during parsing. > You could do some bookkeeping during parsing and capture JVM state, but > I don't see how to do EA that early. > > Also, please, elaborate on #3. It's not clear to me what do you mean there. > I added a PEAState for each basic block[2]. If we need to materialize a virtual object, we just create a new AllocateNode and its children in-place[3]. In Stadler's algorithm, it inherently performs "scalar replacement". Its first step is to delete the original allocation node. PEA phase replaces LoadField nodes with scalars. Because I propose to focus on 'escaping object' only, we don't perform 'scalar replacement'. I leave them to the C2 EA/SR. that's why I say "I refrain from on-the-fly scalar replacement". >> The test excises them all. I pasted 3 graphs here[2]. When we >> materialize an object, we just clone it with the right JVMState. It >> shows that C2 IterEA can automatically picks up the obsolete object and >> get rid of it, as we expected. >> >> It turns out cloning an object isn't as complex as I thought. I mainly >> spent time on adjusting JVMState for the cloned AllocateNode. Not only >> to call sync_jvm(), I also need to 1) kill dead locals 2) clean stack >> and even avoid reexecution that bci. >> >> JVMState* jvms = parser->sync_jvms(); >> SafePointNode* map = jvms->map(); >> parser->kill_dead_locals(); >> parser->clean_stack(jvms->sp()); >> jvms->set_should_reexecute(false); >> >> Clearly, the algorithm hasn't completed yet. I am still working on >> MergeProcessor, general classes fields and loop construct. > > There was a previous discussion on PEA for C2 back in 2021 [2] [3]. One > interesting observation related to your current experiments was: > > "4. Escape sites separate the graph into 2 parts: before and after the > instance escapes. In order to preserve identity invariants (and avoid > identity paradoxes), PEA can't just put an allocation at every escape > site. It should respect the order of escape events and ensure that the > very same object is observed when multiple escape events happen. > > Dynamic invariant can be formulated as: there should never be more than > 1 allocation at runtime per 1 eliminated allocation. > > Considering non-escaping operations can force materialization on their > own, it poses additional constraints." > > So, when you clone an allocation, you should ensure that only a single > instance can be observed. And safepoints can be escape points as well > (rematerialization in case of deoptimization event). > This is fairly complex. That's why I suggest to focus on 'escaping objects' only in C2 PEA. Please assume that all non-escape objects remain intact after our PEA. I think we can guarantee dynamic invariant here. First of all, we traverse basic blocks in reverse-post-order(RPO). It's in the same direction of execution. An object allocation is virtual initially. Once a virtual object becomes materialized, it won't change back. We track allocation states so we know that. The following appearances of an materialized object won't cause 'materialization' again. It will be treated as a ordinary object. Here is Example2 and I am still working on it. Beside the place where x escapes to _cache, we also need to materialize the virtual object before merging two basic blocks. this is described in 5.3 Merge nodes of Stadler's CGO paper. class Example2 { private Object _cache; public Object foo(boolean cond) { Object x = new Object(); blackhole(); if (cond) { _cache = x; } return x; } public static void blackhole() {} ... } We expect to see code as follows after PEA. "x2 = new Object()" is the result of materialization at merging point. public Object foo(boolean cond) { Object x0 = new Object(); blackhole(); if (cond) { x1 = new Object(); _cache = x1; } x3 = phi(x2 = new Object(), x1); return x3; } We've proved that the obsolete object is either dead or scalar replaceable after PEA[4]. We expect C2 EA/SR to get rid of x0 down the road. Please note that x0(the obsolete obj) dominates all clones(x1 and x2) because we materialize them before merging, we can guarantee dynamic invariant. I plan to mark the original AllocateNode 'obsolete' if PEA does materialize it. I am going to assert in MacroExpansion that it won't expand an obsolete object. If it did, it would introduce redundancy and violate the 'dynamic invariant'. [1]https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047536.html [2]https://github.com/navyxliu/jdk/blob/PEA_parser/src/hotspot/share/opto/parse.hpp#L171 [3] https://github.com/navyxliu/jdk/blob/PEA_parser/src/hotspot/share/opto/parseHelper.cpp#L367 [4]https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-October/059432.html >> I haven't figured out how to test PEA in a reliable way. It is not easy >> for IR framework to capture node movement. If we measure allocation >> rate, it will be subject to CPU capability and also the sampling rate. I >> came up with an idea so-called 'Epsilon-Test'. We create a JVM with >> EpsilonGC and a fixed Java heap. Because EpsilonGC never replenish the >> java heap, we can count how many iterations a test can run before OOME. >> The less allocation made in a method, the more iterations HotSpot can >> execute the method. This isn't perfect either. I found that hotspot >> can't guarantee to execute the final-block in this case[3]. So far, I >> just measure execution time instead. > > It sounds more like a job for benchmarks, but focused on measuring > allocation rate (per iteration). ("-prof gc" mode in JMH terms.) > > Personally, I very much liked the IR framework-based approach Cesar used > in the unit test for allocation merges [4]. Do you see any problems with > that? > > Best regards, > Vladimir Ivanov > Okay. I will follow this direction. thanks, --lx > [1] https://bugs.openjdk.org/browse/JDK-8267532 > [2] > https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047486.html > [3] > https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047536.html > [4] https://github.com/openjdk/jdk/pull/9073 > > >> >> Appreciate your feedbacks or you spot any redflag. >> >> [1] https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79 >> >> [2] >> https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79?permalink_comment_id=4341838#gistcomment-4341838 >> >> [3]?https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79#file-example1-java-L43 >> >> thanks, >> --lx >> >> >> >> >> On 10/12/22 11:17 AM, Vladimir Kozlov wrote: >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>> >>> >>> >>> On 10/12/22 7:58 AM, Liu, Xin wrote: >>>> hi, Vladimir, >>>>> You should show that your implementation can rematirealize an object >>>> at any escape site. >>>> >>>> My understanding is I suppose to 'materialize' an object at any escape site. >>> >>> Words ;^) >>> >>> Yes, I mistyped and misspelled. >>> >>> Vladimir K >>> >>>> >>>> 'rematerialize' refers to 'create an scalar-replaced object on heap' in >>>> deoptimization. It's for interpreter as if the object was created in the >>>> first place. It doesn't apply to an escaped object because it's marked >>>> 'GlobalEscaped' in C2 EA. >>>> >>>> >>>> Okay. I will try this idea! >>>> >>>> thanks, >>>> --lx >>>> >>>> >>>> >>>> >>>> On 10/11/22 3:12 PM, Vladimir Kozlov wrote: >>>>> Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore. -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From sviswanathan at openjdk.org Mon Oct 24 20:35:53 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 24 Oct 2022 20:35:53 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v3] In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 20:20:58 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > further restrict UsePolyIntrinsics with supports_avx512vlbw test/micro/org/openjdk/bench/javax/crypto/full/Poly1305DigestBench.java line 37: > 35: import java.security.spec.AlgorithmParameterSpec; > 36: import javax.crypto.spec.SecretKeySpec; > 37: Please add the following: import org.openjdk.jmh.annotations.Fork; @Fork(value = 1, jvmArgsAppend = {"--add-opens", "java.base/com.sun.crypto.provider=A LL-UNNAMED"}) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From eastigeevich at openjdk.org Mon Oct 24 21:02:50 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 24 Oct 2022 21:02:50 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.cpp line 175: > 173: write(_curr_byte << (8 - _bit_pos)); > 174: _curr_byte = 0; > 175: _bit_pos = 0; Let's extract this and call it `flush()`. src/hotspot/share/code/compressedStream.hpp line 185: > 183: int position(); // method have a side effect: the current byte becomes aligned > 184: void set_position(int pos) { > 185: position(); `flush()` ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Mon Oct 24 21:13:55 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 24 Oct 2022 21:13:55 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.hpp line 186: > 184: void set_position(int pos) { > 185: position(); > 186: _position = pos; I think `pos` must be `<= position()`. Should we check this? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From duke at openjdk.org Mon Oct 24 22:06:56 2022 From: duke at openjdk.org (vpaprotsk) Date: Mon, 24 Oct 2022 22:06:56 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v4] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s vpaprotsk has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - assembler checks and test case fixes - Merge remote-tracking branch 'origin/master' into avx512-poly - Merge remote-tracking branch 'origin' into avx512-poly - further restrict UsePolyIntrinsics with supports_avx512vlbw - missed white-space fix - - Fix whitespace and copyright statements - Add benchmark - Merge remote-tracking branch 'vpaprotsk/master' into avx512-poly - Poly1305 AVX512 intrinsic for x86_64 ------------- Changes: https://git.openjdk.org/jdk/pull/10582/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=03 Stats: 1719 lines in 30 files changed: 1685 ins; 3 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Mon Oct 24 22:06:57 2022 From: duke at openjdk.org (vpaprotsk) Date: Mon, 24 Oct 2022 22:06:57 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v4] In-Reply-To: References: Message-ID: On Tue, 18 Oct 2022 06:26:38 GMT, Jatin Bhateja wrote: >> vpaprotsk has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - assembler checks and test case fixes >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Merge remote-tracking branch 'origin' into avx512-poly >> - further restrict UsePolyIntrinsics with supports_avx512vlbw >> - missed white-space fix >> - - Fix whitespace and copyright statements >> - Add benchmark >> - Merge remote-tracking branch 'vpaprotsk/master' into avx512-poly >> - Poly1305 AVX512 intrinsic for x86_64 > > src/hotspot/cpu/x86/assembler_x86.cpp line 5484: > >> 5482: >> 5483: void Assembler::evpunpckhqdq(XMMRegister dst, KRegister mask, XMMRegister src1, XMMRegister src2, bool merge, int vector_len) { >> 5484: assert(UseAVX > 2, "requires AVX512F"); > > Please replace flag with feature EVEX check. done > src/hotspot/cpu/x86/assembler_x86.cpp line 7831: > >> 7829: >> 7830: void Assembler::vpandq(XMMRegister dst, XMMRegister nds, Address src, int vector_len) { >> 7831: assert(VM_Version::supports_evex(), ""); > > Assertion should check existence of AVX512VL for non 512 but vectors. done > src/hotspot/cpu/x86/assembler_x86.cpp line 7958: > >> 7956: >> 7957: void Assembler::vporq(XMMRegister dst, XMMRegister nds, Address src, int vector_len) { >> 7958: assert(VM_Version::supports_evex(), ""); > > Same as above done > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 1960: > >> 1958: address StubGenerator::generate_poly1305_masksCP() { >> 1959: StubCodeMark mark(this, "StubRoutines", "generate_poly1305_masksCP"); >> 1960: address start = __ pc(); > > You may use [align64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp#L777) here, like done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Mon Oct 24 22:06:58 2022 From: duke at openjdk.org (vpaprotsk) Date: Mon, 24 Oct 2022 22:06:58 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v4] In-Reply-To: <523ASDMlZe7mAZaBQe3ipxBLaLum7_XZqLLUUgsCJi0=.db28f521-c957-4fb2-8dcc-7c09d46189e3@github.com> References: <523ASDMlZe7mAZaBQe3ipxBLaLum7_XZqLLUUgsCJi0=.db28f521-c957-4fb2-8dcc-7c09d46189e3@github.com> Message-ID: On Tue, 18 Oct 2022 23:03:55 GMT, Sandhya Viswanathan wrote: >> vpaprotsk has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - assembler checks and test case fixes >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Merge remote-tracking branch 'origin' into avx512-poly >> - further restrict UsePolyIntrinsics with supports_avx512vlbw >> - missed white-space fix >> - - Fix whitespace and copyright statements >> - Add benchmark >> - Merge remote-tracking branch 'vpaprotsk/master' into avx512-poly >> - Poly1305 AVX512 intrinsic for x86_64 > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 262: > >> 260: private static void processMultipleBlocks(byte[] input, int offset, int length, byte[] aBytes, byte[] rBytes) { >> 261: MutableIntegerModuloP A = ipl1305.getElement(aBytes).mutable(); >> 262: MutableIntegerModuloP R = ipl1305.getElement(rBytes).mutable(); > > R doesn't need to be mutable. done > test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/java.base/com/sun/crypto/provider/Poly1305IntrinsicFuzzTest.java line 39: > >> 37: public static void main(String[] args) throws Exception { >> 38: //Note: it might be useful to increase this number during development of new Poly1305 intrinsics >> 39: final int repeat = 100; > > Should we increase this repeat count for the c2 compiler to kick in for compiling engineUpdate() and have the call to stub in place from there? did it with `@run main/othervm -Xcomp -XX:-TieredCompilation com.sun.crypto.provider.Cipher.ChaCha20.Poly1305UnitTestDriver` > test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/java.base/com/sun/crypto/provider/Poly1305KAT.java line 133: > >> 131: System.out.println("*** Test " + ++testNumber + ": " + >> 132: test.testName); >> 133: if (runSingleTest(test)) { > > runSingleTest may need to be called enough number of times for the engineUpdate to be compiled by c2. added a second copy with `@run main/othervm -Xcomp -XX:-TieredCompilation com.sun.crypto.provider.Cipher.ChaCha20.Poly1305UnitTestDriver` ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Mon Oct 24 22:07:00 2022 From: duke at openjdk.org (vpaprotsk) Date: Mon, 24 Oct 2022 22:07:00 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v3] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 20:31:31 GMT, Sandhya Viswanathan wrote: >> vpaprotsk has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> further restrict UsePolyIntrinsics with supports_avx512vlbw > > test/micro/org/openjdk/bench/javax/crypto/full/Poly1305DigestBench.java line 37: > >> 35: import java.security.spec.AlgorithmParameterSpec; >> 36: import javax.crypto.spec.SecretKeySpec; >> 37: > > Please add the following: > import org.openjdk.jmh.annotations.Fork; > @Fork(value = 1, jvmArgsAppend = {"--add-opens", "java.base/com.sun.crypto.provider=A > LL-UNNAMED"}) done. Also added longer warmup ------------- PR: https://git.openjdk.org/jdk/pull/10582 From eastigeevich at openjdk.org Mon Oct 24 22:08:48 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 24 Oct 2022 22:08:48 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:44:49 GMT, Boris Ulasevich wrote: >> src/hotspot/share/code/compressedStream.cpp line 152: >> >>> 150: } >>> 151: >>> 152: int CompressedSparseDataWriteStream::position() { >> >> The function with a side effect looks strange to me. I see an assert in `DebugInformationRecorder::DebugInformationRecorder(OopRecorder* oop_recorder)` which uses it for checking. So the assert can cause side affects. I am not sure it is expected. > > Storing the debug info is an iterative process. Chunks of data are compared to avoid duplication, and on some points the generated data is discarded and position is unrolled. Besides read/write stream implementation internals, DebugInformationRecorder uses raw stream data access to track the similar chunks (see [DIR_Chunk](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.cpp#L57)) and [memcpy](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.cpp#L1969) the raw data. We have either to (1) align the data on positions where the DebugInformationRecorder splits data into chunks or to (2) take the bit position into account. > > I experimented with `int position` to contain both the stream bit position in the least significant bits and stream byte position in most significant bits. For me the code becomes less readable and the performance is questionable even without the DebugInformationRecorder update: > > - uint8_t b1 = _buffer[_position] << _bit_pos; > - uint8_t b2 = _buffer[++_position] >> (8 - _bit_pos); > + uint8_t b1 = _buffer[_position >> 3] << (_position & 0x7); > + _position += 8; > + uint8_t b2 = _buffer[_position >> 3] >> (8 - _position & 0x7); > > I would avoid this change and stay with current implementation. In fact, there is not much aligned positions within the data. And `assert(_stream->position() > serialized_null, "sanity");` (thanks for noticing that!) in the constructor makes no problem because data is aligned at the beginning of the stream. If we introduce `flush()` we can explicitly call it before any accesses to the internal buffer: `buffer()` and `position()`. We will add asserts to `buffer()` and `position()` checking whether there is unflushed data. I always prefer being explicit. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From duke at openjdk.org Mon Oct 24 22:09:29 2022 From: duke at openjdk.org (vpaprotsk) Date: Mon, 24 Oct 2022 22:09:29 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: Message-ID: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: extra whitespace character ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/de7e138b..883be106 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From sviswanathan at openjdk.org Tue Oct 25 00:34:53 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 25 Oct 2022 00:34:53 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Mon, 24 Oct 2022 22:09:29 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: > > extra whitespace character src/hotspot/cpu/x86/assembler_x86.cpp line 8306: > 8304: assert(dst != xnoreg, "sanity"); > 8305: InstructionMark im(this); > 8306: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); no_mask_reg should be set to true here as we are not setting the mask register here. src/hotspot/cpu/x86/stubRoutines_x86.cpp line 83: > 81: address StubRoutines::x86::_join_2_3_base64 = NULL; > 82: address StubRoutines::x86::_decoding_table_base64 = NULL; > 83: address StubRoutines::x86::_poly1305_mask_addr = NULL; Please also update the copyright year to 2022 for stubRoutines_x86.cpp and hpp files. src/hotspot/cpu/x86/vm_version_x86.cpp line 925: > 923: _features &= ~CPU_AVX512_VBMI2; > 924: _features &= ~CPU_AVX512_BITALG; > 925: _features &= ~CPU_AVX512_IFMA; This should also be done under is_knights_family(). src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: > 173: // Choice of 1024 is arbitrary, need enough data blocks to amortize conversion overhead > 174: // and not affect platforms without intrinsic support > 175: int blockMultipleLength = (len/BLOCK_LENGTH) * BLOCK_LENGTH; The ByteBuffer version can also benefit from this optimization if it has array as backing storage. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From eliu at openjdk.org Tue Oct 25 03:05:48 2022 From: eliu at openjdk.org (Eric Liu) Date: Tue, 25 Oct 2022 03:05:48 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 14:00:51 GMT, Stuart Monteith wrote: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% > > > These methods can bed specifically tested with: > `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` Sorry for the delay. Only a few trivial style issues. Otherwise it's okay to me. src/hotspot/cpu/aarch64/aarch64.ad line 16948: > 16946: instruct compressBitsI_reg(iRegINoSp dst, iRegIorL2I src, iRegIorL2I mask, > 16947: vRegF tdst, vRegF tsrc, vRegF tmask) %{ > 16948: match(Set dst (CompressBits src mask)); I would suggest aligning the predicate with the conditions in Matcher::match_rule_supported(int opcode). Suggestion: predicate(UseSVE > 1 && VM_Version::supports_svebitperm()); match(Set dst (CompressBits src mask)); src/hotspot/cpu/aarch64/aarch64.ad line 16977: > 16975: __ mov($tmask$$FloatRegister, __ D, 0, $mask$$Register); > 16976: __ sve_bext($tdst$$FloatRegister, __ D, $tsrc$$FloatRegister, $tmask$$FloatRegister); > 16977: __ mov($dst$$Register, $tdst$$FloatRegister, __ D, 0); %} Obviously this is hand-made, not generated by m4. Suggestion: __ mov($dst$$Register, $tdst$$FloatRegister, __ D, 0); %} ------------- PR: https://git.openjdk.org/jdk/pull/10537 From haosun at openjdk.org Tue Oct 25 03:12:09 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 25 Oct 2022 03:12:09 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 02:50:30 GMT, Eric Liu wrote: >> The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. >> >> Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. >> >> Running on an SVE2 enabled system, I ran the following benchmarks: >> >> org.openjdk.bench.java.lang.Integers >> org.openjdk.bench.java.lang.Longs >> >> The time for each operation reduced to 56% to 72% of the original run time: >> >> >> Benchmark Result error Unit % against non-SVE2 >> Integers.expand 2.106 0.011 us/op >> Integers.expand-SVE 1.431 0.009 us/op 67.95% >> Longs.expand 2.606 0.006 us/op >> Longs.expand-SVE 1.46 0.003 us/op 56.02% >> Integers.compress 1.982 0.004 us/op >> Integers.compress-SVE 1.427 0.003 us/op 72.00% >> Longs.compress 2.501 0.002 us/op >> Longs.compress-SVE 1.441 0.003 us/op 57.62% >> >> >> These methods can bed specifically tested with: >> `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` > > src/hotspot/cpu/aarch64/aarch64.ad line 16948: > >> 16946: instruct compressBitsI_reg(iRegINoSp dst, iRegIorL2I src, iRegIorL2I mask, >> 16947: vRegF tdst, vRegF tsrc, vRegF tmask) %{ >> 16948: match(Set dst (CompressBits src mask)); > > I would suggest aligning the predicate with the conditions in Matcher::match_rule_supported(int opcode). > Suggestion: > > predicate(UseSVE > 1 && VM_Version::supports_svebitperm()); > match(Set dst (CompressBits src mask)); I suppose the predicate-stmt is not needed here, since the check has already been done in `match_rule_supported()` helper. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From eastigeevich at openjdk.org Tue Oct 25 09:31:34 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 25 Oct 2022 09:31:34 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: <0r4l045eH8Xv-ymVbD0g0ls0igyo921BE380wWVfduM=.33f48e1d-0e4f-4ea0-b8b4-243da69296a5@github.com> On Mon, 24 Oct 2022 22:05:13 GMT, Evgeny Astigeevich wrote: >> Storing the debug info is an iterative process. Chunks of data are compared to avoid duplication, and on some points the generated data is discarded and position is unrolled. Besides read/write stream implementation internals, DebugInformationRecorder uses raw stream data access to track the similar chunks (see [DIR_Chunk](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.cpp#L57)) and [memcpy](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.cpp#L1969) the raw data. We have either to (1) align the data on positions where the DebugInformationRecorder splits data into chunks or to (2) take the bit position into account. >> >> I experimented with `int position` to contain both the stream bit position in the least significant bits and stream byte position in most significant bits. For me the code becomes less readable and the performance is questionable even without the DebugInformationRecorder update: >> >> - uint8_t b1 = _buffer[_position] << _bit_pos; >> - uint8_t b2 = _buffer[++_position] >> (8 - _bit_pos); >> + uint8_t b1 = _buffer[_position >> 3] << (_position & 0x7); >> + _position += 8; >> + uint8_t b2 = _buffer[_position >> 3] >> (8 - _position & 0x7); >> >> I would avoid this change and stay with current implementation. In fact, there is not much aligned positions within the data. And `assert(_stream->position() > serialized_null, "sanity");` (thanks for noticing that!) in the constructor makes no problem because data is aligned at the beginning of the stream. > > If we introduce `flush()` we can explicitly call it before any accesses to the internal buffer: `buffer()` and `position()`. We will add asserts to `buffer()` and `position()` checking whether there is unflushed data. I always prefer being explicit. I think I have an idea of a solution which does not need `flush()` and will have the read-only `position()`. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From adinn at openjdk.org Tue Oct 25 09:42:56 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 25 Oct 2022 09:42:56 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 03:09:15 GMT, Hao Sun wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 16948: >> >>> 16946: instruct compressBitsI_reg(iRegINoSp dst, iRegIorL2I src, iRegIorL2I mask, >>> 16947: vRegF tdst, vRegF tsrc, vRegF tmask) %{ >>> 16948: match(Set dst (CompressBits src mask)); >> >> I would suggest aligning the predicate with the conditions in Matcher::match_rule_supported(int opcode). >> Suggestion: >> >> predicate(UseSVE > 1 && VM_Version::supports_svebitperm()); >> match(Set dst (CompressBits src mask)); > > I suppose the predicate-stmt is not needed here, since the check has already been done in `match_rule_supported()` helper. That's a good point. x86 rules all appear to omit any checks that appear in `match_rule_supported` (in most cases they have no predicate, in others they have a predicate that includes a further sub-constraint). For AArch64 the predicate test in `match_rule_supported` is omitted for `OP_OnSpinWait` but retained for `Op_CacheWB`, `CacheWBPreSync` and `CacheWBPostSync`. We should probably make this consistent by removing the repeat predicates for those last three cases as well. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From eastigeevich at openjdk.org Tue Oct 25 09:42:59 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 25 Oct 2022 09:42:59 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.cpp line 179: > 177: } > 178: > 179: void CompressedSparseDataWriteStream::write_zero() { We can rewrite the function not to use `_curr_byte`. We work directly on `_buffer` and rename `_bit_pos` into `_used_bits`: _buffer[_position] >>= 1; if (++_used_bits == 8) { _position += 1; _used_bits = 0; } ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Tue Oct 25 10:00:28 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 25 Oct 2022 10:00:28 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.cpp line 188: > 186: } > 187: > 188: void CompressedSparseDataWriteStream::write_byte_impl(uint8_t b) { _buffer[_position] |= (b >> _used_bits); _position += 1; _buffer[_position] = (b << (8 - _used_bits)); ------------- PR: https://git.openjdk.org/jdk/pull/10025 From tholenstein at openjdk.org Tue Oct 25 17:00:39 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 25 Oct 2022 17:00:39 GMT Subject: RFR: JDK-8265441: IGV: select block nodes by clicking on it Message-ID: In "Cluster nodes into blocks" mode, it is now possible to select all nodes in a block by simply double-clicking in the block. The attached images illustrate the new behavior, after double-clicking on block B10. Similarly, the current node selection can be extended with all nodes of a block when holding the Ctrl/Cmd-key and double-clicking on a block. ![select_block](https://user-images.githubusercontent.com/71546117/197827820-4edf3333-f0e8-4e77-849e-b8e09eaf67ef.png) # Overview selection new selection modes in **bold**. We refer to B4/B10 as _blocks_, and 86, 87, 88, ... as _nodes_ ## no key pressed + `click on single node` : select single node, unselect all other nodes + `click on edge` : select src/dest nodes, unselect all other nodes + **`double-click on block` : select all nodes in block, unselect all other nodes** + **`double-click outside of node/block` : unselect all nodes** ## holding down Ctrl/Cmd + `click on single node` : add node to current selection + `click on edge` : invert the selection of src/dest nodes + **`double-click on block` : add all nodes in block to current selection** + `draw selection rectangle` : **invert** the selection of all nodes in rectangle - select unselected nodes, **unselect selected nodes** # Implementation The main functionality was implemented by extending `BlockWidget` with `DoubleClickHandler` and adding methods `handleDoubleClick` / `addToSelection`. We also needed to replace `setSelectedNodes` with `clearSelectedNodes` and `addSelectedNodes` in `InputGraphProvider` and the corresponding methods in `EditorTopComponent`. All code that used `setSelectedNodes` needed to be adjusted accordingly. In order for the `DoubleClickHandler` in `BlockWidget` to work, we needed to extend `selectAction` in `DiagramScene` to _invert_ the selection of the nodes in the rectangle (the `symmetricDiff` set). ------------- Commit messages: - JDK-8265441: IGV: select block nodes by clicking on it Changes: https://git.openjdk.org/jdk/pull/10854/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10854&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8265441 Stats: 260 lines in 16 files changed: 72 ins; 104 del; 84 mod Patch: https://git.openjdk.org/jdk/pull/10854.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10854/head:pull/10854 PR: https://git.openjdk.org/jdk/pull/10854 From rcastanedalo at openjdk.org Tue Oct 25 18:09:49 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 25 Oct 2022 18:09:49 GMT Subject: RFR: JDK-8265441: IGV: select block nodes by clicking on it In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 13:59:50 GMT, Tobias Holenstein wrote: > In "Cluster nodes into blocks" mode, it is now possible to select all nodes in a block by simply double-clicking in the block. The attached images illustrate the new behavior, after double-clicking on block B10. Similarly, the current node selection can be extended with all nodes of a block when holding the Ctrl/Cmd-key and double-clicking on a block. > > ![select_block](https://user-images.githubusercontent.com/71546117/197827820-4edf3333-f0e8-4e77-849e-b8e09eaf67ef.png) > > > # Overview selection > new selection modes in **bold**. We refer to B4/B10 as _blocks_, and 86, 87, 88, ... as _nodes_ > > ## no key pressed > + `click on single node` : select single node, unselect all other nodes > + `click on edge` : select src/dest nodes, unselect all other nodes > + **`double-click on block` : select all nodes in block, unselect all other nodes** > + **`double-click outside of node/block` : unselect all nodes** > ## holding down Ctrl/Cmd > + `click on single node` : add node to current selection > + `click on edge` : invert the selection of src/dest nodes > + **`double-click on block` : add all nodes in block to current selection** > + `draw selection rectangle` : **invert** the selection of all nodes in rectangle > - select unselected nodes, **unselect selected nodes** > > > # Implementation > The main functionality was implemented by extending `BlockWidget` with `DoubleClickHandler` and adding methods `handleDoubleClick` / `addToSelection`. We also needed to replace `setSelectedNodes` with `clearSelectedNodes` and `addSelectedNodes` in `InputGraphProvider` and the corresponding methods in `EditorTopComponent`. All code that used `setSelectedNodes` needed to be adjusted accordingly. > > In order for the `DoubleClickHandler` in `BlockWidget` to work, we needed to extend `selectAction` in `DiagramScene` to _invert_ the selection of the nodes in the rectangle (the `symmetricDiff` set). Great functionality, thanks! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/10854 From iveresov at openjdk.org Tue Oct 25 20:00:26 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 25 Oct 2022 20:00:26 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 Message-ID: The fix does two things: 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. Testing is clean, Valhalla testing is clean too. ------------- Commit messages: - Add test - Fix scalarization - Allow direct constant folding Changes: https://git.openjdk.org/jdk/pull/10861/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10861&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295066 Stats: 260 lines in 9 files changed: 178 ins; 46 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/10861.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10861/head:pull/10861 PR: https://git.openjdk.org/jdk/pull/10861 From kvn at openjdk.org Tue Oct 25 22:06:17 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Oct 2022 22:06:17 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. Looks good. Please, test full first 3 tier1-3 (not just hs-tier*). ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10861 From jnimeh at openjdk.org Tue Oct 25 22:09:49 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Tue, 25 Oct 2022 22:09:49 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Mon, 24 Oct 2022 22:09:29 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: > > extra whitespace character src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 171: > 169: } > 170: > 171: if (len >= 1024) { Out of curiosity, do you have any perf numbers for the impact of this change on systems that do not support AVX512? Does this help or hurt (or make a negligible impact) on poly1305 updates when the input is 1K or larger? src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 296: > 294: keyBytes[12] &= (byte)252; > 295: > 296: // This should be enabled, but Poly1305KAT would fail I'm on the fence about this change. I have no problem with it in basic terms. If we ever decided to make this a general purpose Mac in JCE then this would definitely be good to do. As of right now, the only consumer is ChaCha20 and it would submit a key through the process in the RFC. Seems really unlikely to run afoul of these checks, but admittedly not impossible. I would agree with @sviswa7 that we could examine this in a separate change and we could look at other approaches to getting around the KAT issue, perhaps some package-private based way to disable the check. As long as Poly1305 remains with package-private visibility, one could make another form of the constructor with a boolean that would disable this check and that is the constructor that the KAT would use. This is just an off-the-cuff idea, but one way we might get the best of both worlds. If we move this down the road then we should remove the commenting. We can refer back to this PR later. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Tue Oct 25 22:33:28 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Oct 2022 22:33:28 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v3] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 08:13:06 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request incrementally with two additional commits since the last revision: > > - fix > - always clone the Phi with address type EA may incorrectly processing LoadB#971 when it splits unique memory slice for allocation instance. LoadB#971 memory should not reference Phi with different type and EA should should look through MergeMem nodes when creating unique memory slice. If LoadB#971 is load from different object (or merge of objects). Then It's AddP node should not be change to instance specific one. Would be nice to see this subgraph just after EA. ------------- PR: https://git.openjdk.org/jdk/pull/9777 From sviswanathan at openjdk.org Tue Oct 25 23:52:26 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 25 Oct 2022 23:52:26 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Mon, 24 Oct 2022 22:09:29 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: > > extra whitespace character src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 806: > 804: evmovdquq(A0, Address(rsp, 64*0), Assembler::AVX_512bit); > 805: evmovdquq(A0, Address(rsp, 64*1), Assembler::AVX_512bit); > 806: evmovdquq(A0, Address(rsp, 64*2), Assembler::AVX_512bit); This is load from stack into A0. Did you intend to store A0 (cleanup) into stack local area here? I think the source and destination are mixed up here. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From dlong at openjdk.org Tue Oct 25 23:57:22 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 25 Oct 2022 23:57:22 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 08:36:45 GMT, Doug Simon wrote: >> 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes > > src/hotspot/share/asm/codeBuffer.hpp line 755: > >> 753: if (EnableJVMCI) { >> 754: // Graal vectorization requires larger aligned constants >> 755: return 64; > > This means all Graal installed code will pay a penalty even though most installed code does not include constants that need such large alignment. It would be preferable to allow a compiler to specify the alignment requirement per nmethod. I agree. The alignment should probably be a field in CodeBuffer with a default value of 8. ------------- PR: https://git.openjdk.org/jdk/pull/10392 From iveresov at openjdk.org Wed Oct 26 04:19:23 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 26 Oct 2022 04:19:23 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> References: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> Message-ID: On Tue, 25 Oct 2022 22:02:54 GMT, Vladimir Kozlov wrote: > Please, test full first 3 tier1-3 (not just hs-tier*). Done. Looks good. ------------- PR: https://git.openjdk.org/jdk/pull/10861 From kvn at openjdk.org Wed Oct 26 04:47:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Oct 2022 04:47:24 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> Message-ID: On Wed, 26 Oct 2022 04:15:39 GMT, Igor Veresov wrote: > > Please, test full first 3 tier1-3 (not just hs-tier*). > > Done. Looks good. Thank you for running them. ------------- PR: https://git.openjdk.org/jdk/pull/10861 From xlinzheng at openjdk.org Wed Oct 26 05:04:54 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 26 Oct 2022 05:04:54 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic Message-ID: The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: (dpow val1 (dlog val2)) LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. Reproducer: public class A { static int count = 0; public static void print(double var) { if (count % 10000 == 0) { System.out.println(var); } count++; } public static void a(double var1, double var2, double var3) { double var4 = Math.pow(var3, Math.log(var1 / var2)); print(var4); } public static void main(String[] args) { for (int i = 0; i < 50000; i++) { double var21 = 2.2250738585072014E-308D; double var15 = 1.1102230246251565E-16D; double d1 = 2.0D; A.a(var21, var15, d1); } } } The right answer is 6.461124611136231E-203 6.461124611136231E-203 6.461124611136231E-203 6.461124611136231E-203 6.461124611136231E-203 The current backend gives 6.461124611136231E-203 NaN NaN NaN NaN Testing a hotspot tier1~4 on qemu. Thanks, Xiaolin ------------- Commit messages: - Fix simply Changes: https://git.openjdk.org/jdk/pull/10867/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10867&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295926 Stats: 20 lines in 1 file changed: 15 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10867.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10867/head:pull/10867 PR: https://git.openjdk.org/jdk/pull/10867 From thartmann at openjdk.org Wed Oct 26 05:24:23 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Oct 2022 05:24:23 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: <8rFROVmvN4pO0mGVlXs48VNkJ1c0D7UpiBarJIz7QJg=.31693a6d-f908-473b-bedb-f7cd824efb63@github.com> On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. That looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10861 From thartmann at openjdk.org Wed Oct 26 05:36:24 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Oct 2022 05:36:24 GMT Subject: RFR: JDK-8265441: IGV: select block nodes by clicking on it In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 13:59:50 GMT, Tobias Holenstein wrote: > In "Cluster nodes into blocks" mode, it is now possible to select all nodes in a block by simply double-clicking in the block. The attached images illustrate the new behavior, after double-clicking on block B10. Similarly, the current node selection can be extended with all nodes of a block when holding the Ctrl/Cmd-key and double-clicking on a block. > > ![select_block](https://user-images.githubusercontent.com/71546117/197827820-4edf3333-f0e8-4e77-849e-b8e09eaf67ef.png) > > > # Overview selection > new selection modes in **bold**. We refer to B4/B10 as _blocks_, and 86, 87, 88, ... as _nodes_ > > ## no key pressed > + `click on single node` : select single node, unselect all other nodes > + `click on edge` : select src/dest nodes, unselect all other nodes > + **`double-click on block` : select all nodes in block, unselect all other nodes** > + **`double-click outside of node/block` : unselect all nodes** > ## holding down Ctrl/Cmd > + `click on single node` : add node to current selection > + `click on edge` : invert the selection of src/dest nodes > + **`double-click on block` : add all nodes in block to current selection** > + `draw selection rectangle` : **invert** the selection of all nodes in rectangle > - select unselected nodes, **unselect selected nodes** > > > # Implementation > The main functionality was implemented by extending `BlockWidget` with `DoubleClickHandler` and adding methods `handleDoubleClick` / `addToSelection`. We also needed to replace `setSelectedNodes` with `clearSelectedNodes` and `addSelectedNodes` in `InputGraphProvider` and the corresponding methods in `EditorTopComponent`. All code that used `setSelectedNodes` needed to be adjusted accordingly. > > In order for the `DoubleClickHandler` in `BlockWidget` to work, we needed to extend `selectAction` in `DiagramScene` to _invert_ the selection of the nodes in the rectangle (the `symmetricDiff` set). Should un-select by double clicking on a block that has selected nodes also work? ------------- PR: https://git.openjdk.org/jdk/pull/10854 From thartmann at openjdk.org Wed Oct 26 05:40:27 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Oct 2022 05:40:27 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v5] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 13:41:12 GMT, SuperCoder79 wrote: >> Hello, >> I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: >> * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. >> * The removal of the memory load would have a beneficial effect in cache bound situations. >> * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. >> >> As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. >> >> I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. >> >> Thanks for your time, >> Jasmine > > SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review > > - Added interpreter assert Still looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9642 From thartmann at openjdk.org Wed Oct 26 05:48:19 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Oct 2022 05:48:19 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic In-Reply-To: References: Message-ID: <_S-vGQQbZCTlBB4Y1yilMzoUJQhhTSGafo8S9UDjIqQ=.33612fa4-bd16-45e7-b6e0-3402fd354c39@github.com> On Wed, 26 Oct 2022 04:57:11 GMT, Xiaolin Zheng wrote: > The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: > > (dpow val1 (dlog val2)) > > LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. > > > Reproducer: > > > public class A { > > static int count = 0; > > public static void print(double var) { > if (count % 10000 == 0) { > System.out.println(var); > } > count++; > } > > public static void a(double var1, double var2, double var3) { > double var4 = Math.pow(var3, Math.log(var1 / var2)); > print(var4); > } > > public static void main(String[] args) { > > for (int i = 0; i < 50000; i++) { > double var21 = 2.2250738585072014E-308D; > double var15 = 1.1102230246251565E-16D; > double d1 = 2.0D; > A.a(var21, var15, d1); > } > > } > > } > > > The right answer is > > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > > > The current backend gives > > 6.461124611136231E-203 > NaN > NaN > NaN > NaN > > > Testing a hotspot tier1~4 on qemu. > > Thanks, > Xiaolin Wouldn't it make sense to add the reproducer as a test? ------------- PR: https://git.openjdk.org/jdk/pull/10867 From xlinzheng at openjdk.org Wed Oct 26 05:53:21 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 26 Oct 2022 05:53:21 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic In-Reply-To: <_S-vGQQbZCTlBB4Y1yilMzoUJQhhTSGafo8S9UDjIqQ=.33612fa4-bd16-45e7-b6e0-3402fd354c39@github.com> References: <_S-vGQQbZCTlBB4Y1yilMzoUJQhhTSGafo8S9UDjIqQ=.33612fa4-bd16-45e7-b6e0-3402fd354c39@github.com> Message-ID: <6pPQ-jOXNgKww_drdjpQz9pjfconuDLWUzAIGu1uA44=.8aa34eeb-b924-4902-b21e-1fad688d954f@github.com> On Wed, 26 Oct 2022 05:46:03 GMT, Tobias Hartmann wrote: > Wouldn't it make sense to add the reproducer as a test? Thank you for the advice and will make one. ------------- PR: https://git.openjdk.org/jdk/pull/10867 From thartmann at openjdk.org Wed Oct 26 06:53:26 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Oct 2022 06:53:26 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v3] In-Reply-To: References: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> Message-ID: <1YuBVz76PEE6DZyMIMY3YmErHlzdVV7-wqZCPMZAi0g=.e7d746ec-663e-4039-ab88-65c7c8851ef6@github.com> On Wed, 24 Aug 2022 02:23:33 GMT, Yi Yang wrote: >> I think we should add an IR verification test for [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585). > >> I think we should add an IR verification test for [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585). > > Yes, we need a verification test for it. I'll do this later. Comment to keep this open. @kelthuzadx, any update on this? ------------- PR: https://git.openjdk.org/jdk/pull/9695 From xlinzheng at openjdk.org Wed Oct 26 07:29:52 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 26 Oct 2022 07:29:52 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v2] In-Reply-To: References: Message-ID: <7kRcAX3MU3MBW2ZGRA9dw1dDXAIR4vXtijetM16e704=.db8a0b31-1b5d-4ebc-a76a-fc0f6ffeaecf@github.com> > The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: > > (dpow val1 (dlog val2)) > > LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. > > > Reproducer: > > > public class A { > > static int count = 0; > > public static void print(double var) { > if (count % 10000 == 0) { > System.out.println(var); > } > count++; > } > > public static void a(double var1, double var2, double var3) { > double var4 = Math.pow(var3, Math.log(var1 / var2)); > print(var4); > } > > public static void main(String[] args) { > > for (int i = 0; i < 50000; i++) { > double var21 = 2.2250738585072014E-308D; > double var15 = 1.1102230246251565E-16D; > double d1 = 2.0D; > A.a(var21, var15, d1); > } > > } > > } > > > The right answer is > > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > > > The current backend gives > > 6.461124611136231E-203 > NaN > NaN > NaN > NaN > > > Testing a hotspot tier1~4 on qemu. > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Add one test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10867/files - new: https://git.openjdk.org/jdk/pull/10867/files/6706d1af..e686497c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10867&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10867&range=00-01 Stats: 88 lines in 1 file changed: 88 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10867.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10867/head:pull/10867 PR: https://git.openjdk.org/jdk/pull/10867 From xlinzheng at openjdk.org Wed Oct 26 07:29:52 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 26 Oct 2022 07:29:52 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 04:57:11 GMT, Xiaolin Zheng wrote: > The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: > > (dpow val1 (dlog val2)) > > LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. > > > Reproducer: > > > public class A { > > static int count = 0; > > public static void print(double var) { > if (count % 10000 == 0) { > System.out.println(var); > } > count++; > } > > public static void a(double var1, double var2, double var3) { > double var4 = Math.pow(var3, Math.log(var1 / var2)); > print(var4); > } > > public static void main(String[] args) { > > for (int i = 0; i < 50000; i++) { > double var21 = 2.2250738585072014E-308D; > double var15 = 1.1102230246251565E-16D; > double d1 = 2.0D; > A.a(var21, var15, d1); > } > > } > > } > > > The right answer is > > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > > > The current backend gives > > 6.461124611136231E-203 > NaN > NaN > NaN > NaN > > > Testing a hotspot tier1~4 on qemu. > > Thanks, > Xiaolin The test is copied and modified from Roland's `TestPow2.java` in the same folder. On x86, aarch64 and riscv with this patch it passes; riscv without this patch shows interpreter = 2.5355263553695413 c1 = 0.844936682323691 c2 = 2.5355263553695413 and fails. FYI python3 >>> math.exp(3.1415926) 23.14069139267437 >>> math.log10(math.exp(3.1415926)) 1.3643763305680898 >>> math.log(math.log10(math.exp(3.1415926))) 0.3106974235432832 >>> math.tan(math.log(math.log10(math.exp(3.1415926)))) 0.32109666300670675 >>> math.cos(math.tan(math.log(math.log10(math.exp(3.1415926))))) 0.94888987383311 >>> math.sin(math.cos(math.tan(math.log(math.log10(math.exp(3.1415926)))))) 0.812769262085064 >>> math.pow(3.1415926, math.sin(math.cos(math.tan(math.log(math.log10(math.exp(3.1415926))))))) 2.5355263553695413 Thanks, Xiaolin ------------- PR: https://git.openjdk.org/jdk/pull/10867 From tholenstein at openjdk.org Wed Oct 26 07:56:26 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 26 Oct 2022 07:56:26 GMT Subject: RFR: JDK-8265441: IGV: select block nodes by clicking on it In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 05:34:04 GMT, Tobias Hartmann wrote: > Should un-select by double clicking on a block that has selected nodes also work? Hi @TobiHartmann Thanks for suggestion! I decided to leave this out for the following reason: - What happens if some of the nodes in e.g. B10 are selected by double-clicking? Either unselect them all or select them all. If a user double-clicks on a node and then node by node unselects them all - When now double clicking again, should we select or unselect all? The user probably expects that all nodes get selected since all are unselected, but the "select state" of the block would still be selected from the last double-click. Of course we could update the block state for every change in selection that we make. But I think it makes things complicated. Another option would be to always invert the selection (like with the rectangle selection), but I don't think this is very intuitive for the user. My suggestion is to leave this out for the moment. If desired it can still be introduced in the future. ------------- PR: https://git.openjdk.org/jdk/pull/10854 From thartmann at openjdk.org Wed Oct 26 08:10:20 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Oct 2022 08:10:20 GMT Subject: RFR: JDK-8265441: IGV: select block nodes by clicking on it In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 13:59:50 GMT, Tobias Holenstein wrote: > In "Cluster nodes into blocks" mode, it is now possible to select all nodes in a block by simply double-clicking in the block. The attached images illustrate the new behavior, after double-clicking on block B10. Similarly, the current node selection can be extended with all nodes of a block when holding the Ctrl/Cmd-key and double-clicking on a block. > > ![select_block](https://user-images.githubusercontent.com/71546117/197827820-4edf3333-f0e8-4e77-849e-b8e09eaf67ef.png) > > > # Overview selection > new selection modes in **bold**. We refer to B4/B10 as _blocks_, and 86, 87, 88, ... as _nodes_ > > ## no key pressed > + `click on single node` : select single node, unselect all other nodes > + `click on edge` : select src/dest nodes, unselect all other nodes > + **`double-click on block` : select all nodes in block, unselect all other nodes** > + **`double-click outside of node/block` : unselect all nodes** > ## holding down Ctrl/Cmd > + `click on single node` : add node to current selection > + `click on edge` : invert the selection of src/dest nodes > + **`double-click on block` : add all nodes in block to current selection** > + `draw selection rectangle` : **invert** the selection of all nodes in rectangle > - select unselected nodes, **unselect selected nodes** > > > # Implementation > The main functionality was implemented by extending `BlockWidget` with `DoubleClickHandler` and adding methods `handleDoubleClick` / `addToSelection`. We also needed to replace `setSelectedNodes` with `clearSelectedNodes` and `addSelectedNodes` in `InputGraphProvider` and the corresponding methods in `EditorTopComponent`. All code that used `setSelectedNodes` needed to be adjusted accordingly. > > In order for the `DoubleClickHandler` in `BlockWidget` to work, we needed to extend `selectAction` in `DiagramScene` to _invert_ the selection of the nodes in the rectangle (the `symmetricDiff` set). Okay, makes sense to me. Thanks for the explanation. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10854 From fgao at openjdk.org Wed Oct 26 08:24:33 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 26 Oct 2022 08:24:33 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck Message-ID: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> `-XX:+SuperWordRTDepCheck` is a develop flag and misses proper implementation. But when enabled, it could change code path, resulting in the failures in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781) and [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881). As @vnkozlov suggested in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781), the small patch converts the flag to pure debug code to avoid effect on code generation. ------------- Commit messages: - 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck Changes: https://git.openjdk.org/jdk/pull/10868/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10868&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291781 Stats: 4 lines in 2 files changed: 1 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10868.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10868/head:pull/10868 PR: https://git.openjdk.org/jdk/pull/10868 From xlinzheng at openjdk.org Wed Oct 26 09:04:48 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 26 Oct 2022 09:04:48 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v3] In-Reply-To: References: Message-ID: > The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: > > (dpow val1 (dlog val2)) > > LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. > > > Reproducer: > > > public class A { > > static int count = 0; > > public static void print(double var) { > if (count % 10000 == 0) { > System.out.println(var); > } > count++; > } > > public static void a(double var1, double var2, double var3) { > double var4 = Math.pow(var3, Math.log(var1 / var2)); > print(var4); > } > > public static void main(String[] args) { > > for (int i = 0; i < 50000; i++) { > double var21 = 2.2250738585072014E-308D; > double var15 = 1.1102230246251565E-16D; > double d1 = 2.0D; > A.a(var21, var15, d1); > } > > } > > } > > > The right answer is > > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > > > The current backend gives > > 6.461124611136231E-203 > NaN > NaN > NaN > NaN > > > Testing a hotspot tier1~4 on qemu. > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Maybe a license ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10867/files - new: https://git.openjdk.org/jdk/pull/10867/files/e686497c..01e54b45 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10867&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10867&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10867.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10867/head:pull/10867 PR: https://git.openjdk.org/jdk/pull/10867 From yyang at openjdk.org Wed Oct 26 11:15:28 2022 From: yyang at openjdk.org (Yi Yang) Date: Wed, 26 Oct 2022 11:15:28 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v3] In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 22:29:53 GMT, Vladimir Kozlov wrote: > EA may incorrectly processing LoadB#971 when it splits unique memory slice for allocation instance. LoadB#971 memory should not reference Phi with different type and EA should should look through MergeMem nodes when creating unique memory slice. If LoadB#971 is load from different object (or merge of objects). Then It's AddP node should not be change to instance specific one. Would be nice to see this subgraph just after EA. @vnkozlov LoadB#971 is not processed by EA, it was expanded by ArrayCopyNode#338(which is processed by EA) in arraycopy_forward: MergeMem#466 | | v ArrayCopy#337 | v Proj#242 | | v ArrayCopy#338 (_src_type/_dest_type has precise types) https://github.com/openjdk/jdk/blob/78454b69da1434da18193d32813c59126348c9ea/src/hotspot/share/opto/arraycopynode.cpp#L383-L408 mem is Proj#242, mm(`960 MergeMem === _ 1 242 1 1 967 [[]]`) is based on mem, _src_type and _dest_type are precise types(`byte[int:8]:NotNull:exact+any *,iid=177`) after [JDK-8233164](https://bugs.openjdk.org/browse/JDK-8233164). When generating LoadB#971, we can not find _src_type alias index(32) from mm, so it uses base memory(Proj#242) as its memory input: 242 Proj === 337 [[ ... ]] #2 Memory: @BotPTR *+bot, idx=Bot; 971 LoadB === 958 242 969 ------------- PR: https://git.openjdk.org/jdk/pull/9777 From tholenstein at openjdk.org Wed Oct 26 14:06:48 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 26 Oct 2022 14:06:48 GMT Subject: RFR: JDK-8265441: IGV: select block nodes by clicking on it In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 08:08:05 GMT, Tobias Hartmann wrote: >> In "Cluster nodes into blocks" mode, it is now possible to select all nodes in a block by simply double-clicking in the block. The attached images illustrate the new behavior, after double-clicking on block B10. Similarly, the current node selection can be extended with all nodes of a block when holding the Ctrl/Cmd-key and double-clicking on a block. >> >> ![select_block](https://user-images.githubusercontent.com/71546117/197827820-4edf3333-f0e8-4e77-849e-b8e09eaf67ef.png) >> >> >> # Overview selection >> new selection modes in **bold**. We refer to B4/B10 as _blocks_, and 86, 87, 88, ... as _nodes_ >> >> ## no key pressed >> + `click on single node` : select single node, unselect all other nodes >> + `click on edge` : select src/dest nodes, unselect all other nodes >> + **`double-click on block` : select all nodes in block, unselect all other nodes** >> + **`double-click outside of node/block` : unselect all nodes** >> ## holding down Ctrl/Cmd >> + `click on single node` : add node to current selection >> + `click on edge` : invert the selection of src/dest nodes >> + **`double-click on block` : add all nodes in block to current selection** >> + `draw selection rectangle` : **invert** the selection of all nodes in rectangle >> - select unselected nodes, **unselect selected nodes** >> >> >> # Implementation >> The main functionality was implemented by extending `BlockWidget` with `DoubleClickHandler` and adding methods `handleDoubleClick` / `addToSelection`. We also needed to replace `setSelectedNodes` with `clearSelectedNodes` and `addSelectedNodes` in `InputGraphProvider` and the corresponding methods in `EditorTopComponent`. All code that used `setSelectedNodes` needed to be adjusted accordingly. >> >> In order for the `DoubleClickHandler` in `BlockWidget` to work, we needed to extend `selectAction` in `DiagramScene` to _invert_ the selection of the nodes in the rectangle (the `symmetricDiff` set). > > Okay, makes sense to me. Thanks for the explanation. Thank you @TobiHartmann and @robcasloz for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10854 From tholenstein at openjdk.org Wed Oct 26 14:08:27 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 26 Oct 2022 14:08:27 GMT Subject: Integrated: JDK-8265441: IGV: select block nodes by clicking on it In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 13:59:50 GMT, Tobias Holenstein wrote: > In "Cluster nodes into blocks" mode, it is now possible to select all nodes in a block by simply double-clicking in the block. The attached images illustrate the new behavior, after double-clicking on block B10. Similarly, the current node selection can be extended with all nodes of a block when holding the Ctrl/Cmd-key and double-clicking on a block. > > ![select_block](https://user-images.githubusercontent.com/71546117/197827820-4edf3333-f0e8-4e77-849e-b8e09eaf67ef.png) > > > # Overview selection > new selection modes in **bold**. We refer to B4/B10 as _blocks_, and 86, 87, 88, ... as _nodes_ > > ## no key pressed > + `click on single node` : select single node, unselect all other nodes > + `click on edge` : select src/dest nodes, unselect all other nodes > + **`double-click on block` : select all nodes in block, unselect all other nodes** > + **`double-click outside of node/block` : unselect all nodes** > ## holding down Ctrl/Cmd > + `click on single node` : add node to current selection > + `click on edge` : invert the selection of src/dest nodes > + **`double-click on block` : add all nodes in block to current selection** > + `draw selection rectangle` : **invert** the selection of all nodes in rectangle > - select unselected nodes, **unselect selected nodes** > > > # Implementation > The main functionality was implemented by extending `BlockWidget` with `DoubleClickHandler` and adding methods `handleDoubleClick` / `addToSelection`. We also needed to replace `setSelectedNodes` with `clearSelectedNodes` and `addSelectedNodes` in `InputGraphProvider` and the corresponding methods in `EditorTopComponent`. All code that used `setSelectedNodes` needed to be adjusted accordingly. > > In order for the `DoubleClickHandler` in `BlockWidget` to work, we needed to extend `selectAction` in `DiagramScene` to _invert_ the selection of the nodes in the rectangle (the `symmetricDiff` set). This pull request has now been integrated. Changeset: 31359143 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/313591436202d6259c1f9ae6d50ff7c59b5b0710 Stats: 260 lines in 16 files changed: 72 ins; 104 del; 84 mod 8265441: IGV: select block nodes by clicking on it Reviewed-by: rcastanedalo, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10854 From kvn at openjdk.org Wed Oct 26 15:15:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Oct 2022 15:15:26 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> Message-ID: On Wed, 26 Oct 2022 08:17:00 GMT, Fei Gao wrote: > `-XX:+SuperWordRTDepCheck` is a develop flag and misses proper implementation. But when enabled, it could change code path, resulting in the failures in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781) and [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881). As @vnkozlov suggested in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781), the small patch converts the flag to pure debug code to avoid effect on code generation. Looks good. Please, also verify fix with `compiler/codegen/Test*Vect.java` tests which failed according to [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881) ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10868 From duke at openjdk.org Wed Oct 26 15:30:23 2022 From: duke at openjdk.org (vpaprotsk) Date: Wed, 26 Oct 2022 15:30:23 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Tue, 25 Oct 2022 21:57:34 GMT, Jamil Nimeh wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 296: > >> 294: keyBytes[12] &= (byte)252; >> 295: >> 296: // This should be enabled, but Poly1305KAT would fail > > I'm on the fence about this change. I have no problem with it in basic terms. If we ever decided to make this a general purpose Mac in JCE then this would definitely be good to do. As of right now, the only consumer is ChaCha20 and it would submit a key through the process in the RFC. Seems really unlikely to run afoul of these checks, but admittedly not impossible. > > I would agree with @sviswa7 that we could examine this in a separate change and we could look at other approaches to getting around the KAT issue, perhaps some package-private based way to disable the check. As long as Poly1305 remains with package-private visibility, one could make another form of the constructor with a boolean that would disable this check and that is the constructor that the KAT would use. This is just an off-the-cuff idea, but one way we might get the best of both worlds. > > If we move this down the road then we should remove the commenting. We can refer back to this PR later. I think I will remove the check for now, dont want to hold up reviews. I wasn't sure how to 'inject a backdoor' to the commented out check either, or at least how to do it in an acceptable way. Your ideas do sound plausible, and if anyone does want this check, I can implement one of the ideas (package private boolean flag? turn it on in the test) while waiting for more reviews to come in. The comment about ChaCha being the only way in is also relevant, thanks. i.e. this is a private class today. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Oct 26 15:51:22 2022 From: duke at openjdk.org (vpaprotsk) Date: Wed, 26 Oct 2022 15:51:22 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: <4FY4SEodgFcdxFXvGWFJWHYCr1GD4nAktLa5SiyPcxM=.384b2818-b6c5-4523-8682-5b730d9ad036@github.com> On Tue, 25 Oct 2022 23:48:49 GMT, Sandhya Viswanathan wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 806: > >> 804: evmovdquq(A0, Address(rsp, 64*0), Assembler::AVX_512bit); >> 805: evmovdquq(A0, Address(rsp, 64*1), Assembler::AVX_512bit); >> 806: evmovdquq(A0, Address(rsp, 64*2), Assembler::AVX_512bit); > > This is load from stack into A0. Did you intend to store A0 (cleanup) into stack local area here? I think the source and destination are mixed up here. Wow! Thank you for spotting this ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Oct 26 15:51:23 2022 From: duke at openjdk.org (vpaprotsk) Date: Wed, 26 Oct 2022 15:51:23 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Tue, 25 Oct 2022 21:48:47 GMT, Jamil Nimeh wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 171: > >> 169: } >> 170: >> 171: if (len >= 1024) { > > Out of curiosity, do you have any perf numbers for the impact of this change on systems that do not support AVX512? Does this help or hurt (or make a negligible impact) on poly1305 updates when the input is 1K or larger? (The first commit in this PR actually has the code without the check if anyone wants to measure.. well its also trivial to edit..) I measured about 50% slowdown on 64 byte payloads. One could argue that 64 bytes is not all that representative, but we don't get much out of assembler at that load either so it didn't seem worth it to figure out some sort of platform check. AVX512 needs at least 256 = 16 blocks.. there is overhead also pre-calculating powers of R that needs to be amortized. Assembler does fall back to 64-bit multiplies for <256, while the Java version will have to use the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, and whatever the benefit of 64, it gets erased by the interface-matching code copying limbs in and out.. Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% slower. I think its acceptable, in order to get 18x? Most/all of the slowdown comes from this need of copying limbs out/in.. I am looking at perhaps copying limbs out in the intrinsic instead. Not very 'pretty'.. limbs are hidden in a nested private class behind an interface.. I would be breaking what is a good design with neat encapsulation. (I accidentally forced-pushed that earlier, if you are curious; non-working). The current version of this code seems more robust in the long term? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From jnimeh at openjdk.org Wed Oct 26 20:48:24 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Wed, 26 Oct 2022 20:48:24 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Wed, 26 Oct 2022 15:47:08 GMT, vpaprotsk wrote: >> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 171: >> >>> 169: } >>> 170: >>> 171: if (len >= 1024) { >> >> Out of curiosity, do you have any perf numbers for the impact of this change on systems that do not support AVX512? Does this help or hurt (or make a negligible impact) on poly1305 updates when the input is 1K or larger? > > (The first commit in this PR actually has the code without the check if anyone wants to measure.. well its also trivial to edit..) > > I measured about 50% slowdown on 64 byte payloads. One could argue that 64 bytes is not all that representative, but we don't get much out of assembler at that load either so it didn't seem worth it to figure out some sort of platform check. > > AVX512 needs at least 256 = 16 blocks.. there is overhead also pre-calculating powers of R that needs to be amortized. Assembler does fall back to 64-bit multiplies for <256, while the Java version will have to use the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, and whatever the benefit of 64, it gets erased by the interface-matching code copying limbs in and out.. > > Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% slower. I think its acceptable, in order to get 18x? > > Most/all of the slowdown comes from this need of copying limbs out/in.. I am looking at perhaps copying limbs out in the intrinsic instead. Not very 'pretty'.. limbs are hidden in a nested private class behind an interface.. I would be breaking what is a good design with neat encapsulation. (I accidentally forced-pushed that earlier, if you are curious; non-working). The current version of this code seems more robust in the long term? 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic. I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From iveresov at openjdk.org Wed Oct 26 20:49:33 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 26 Oct 2022 20:49:33 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: <3judUFx-evWUXwoahXsErqBiA8XwbQpBLuJQU4HqnSE=.a7e1d5e8-e18c-4d1e-9d8b-f69bfc6f045e@github.com> On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. Thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10861 From iveresov at openjdk.org Wed Oct 26 20:49:34 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 26 Oct 2022 20:49:34 GMT Subject: Integrated: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. This pull request has now been integrated. Changeset: 58a7141a Author: Igor Veresov URL: https://git.openjdk.org/jdk/commit/58a7141a0dea5d1b4bfe6d56a95d860c854b3461 Stats: 260 lines in 9 files changed: 178 ins; 46 del; 36 mod 8295066: Folding of loads is broken in C2 after JDK-8242115 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10861 From jnimeh at openjdk.org Wed Oct 26 21:15:25 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Wed, 26 Oct 2022 21:15:25 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Wed, 26 Oct 2022 20:45:57 GMT, Jamil Nimeh wrote: >> (The first commit in this PR actually has the code without the check if anyone wants to measure.. well its also trivial to edit..) >> >> I measured about 50% slowdown on 64 byte payloads. One could argue that 64 bytes is not all that representative, but we don't get much out of assembler at that load either so it didn't seem worth it to figure out some sort of platform check. >> >> AVX512 needs at least 256 = 16 blocks.. there is overhead also pre-calculating powers of R that needs to be amortized. Assembler does fall back to 64-bit multiplies for <256, while the Java version will have to use the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, and whatever the benefit of 64, it gets erased by the interface-matching code copying limbs in and out.. >> >> Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% slower. I think its acceptable, in order to get 18x? >> >> Most/all of the slowdown comes from this need of copying limbs out/in.. I am looking at perhaps copying limbs out in the intrinsic instead. Not very 'pretty'.. limbs are hidden in a nested private class behind an interface.. I would be breaking what is a good design with neat encapsulation. (I accidentally forced-pushed that earlier, if you are curious; non-working). The current version of this code seems more robust in the long term? > > 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic. > > I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though. One small thing maybe: It doesn't look like R in `processMultipleBlocks` and `rbytes` ever changes, so maybe there's no need to repeatedly serialize/deserialize them on every call to engineUpdate? There is already an `r` that is attached to the object that is an IntegerModuloP. Could that be used in `processMultipleBlocks` and perhaps a private byte[] for a serialized r is also a field in Poly1305 that can be passed into the intrinsic method rather than creating it every time? It could be set in `setRSVals`. Perhaps we can recover a little performance there? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Wed Oct 26 23:07:01 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 26 Oct 2022 23:07:01 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 16:50:28 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix x86 tests. Thanks for the clarifications, Cesar. The concept of RAM node still looks like a hack to me. And I'd like the patch to better fit the overall design of EA rather than trying to workaround different peculiarities of the current implementation. I'll try to elaborate why I see RAMs redundant. As of now, it serves dual purpose. It (1) marks a merge point as safe to be untangled during SR; and (2) caches information about field values. I believe you can solve it in a cleaner manner without introducing placeholder nodes and connection graph adjustments. IMO it's all about keeping escape status and properly handling "safe" merges in split_unique_types. One possible way to handle merge points is: * Handle merge points in adjust_scalar_replaceable_state and refrain from marking relevant bases as NSR when possible. * After adjust_scalar_replaceable_state is over, every merge point should have all its inputs as either NSR or SR. * split_unique_types incrementally builds value phis to eventually replace the base phi at merge point while processing SR allocations one by one. * After split_unique_types is done, there are no merge points anymore, each allocation has a dedicated memory graph and allocation elimination can proceed as before. Do you see any problems with such an approach? One thing still confuses me though: the patch mentions that RAMs can merge both eliminated and not-yet-eliminated allocations. What's the intended use case? I believe it's still required to have all merged allocations to be eventually eliminated. Do you try to handle the case during allocation elimination when part of the inputs are already eliminated and the rest is pending their turn? ------------- PR: https://git.openjdk.org/jdk/pull/9073 From vlivanov at openjdk.org Wed Oct 26 23:13:12 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 26 Oct 2022 23:13:12 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: <2oRk_zhIijFndoBhzxjsMFizcXEdjIF2Iryi5DSstCA=.1e9a06f9-1717-4bd3-918a-4fde78a50d02@github.com> References: <2oRk_zhIijFndoBhzxjsMFizcXEdjIF2Iryi5DSstCA=.1e9a06f9-1717-4bd3-918a-4fde78a50d02@github.com> Message-ID: On Sat, 22 Oct 2022 00:32:31 GMT, Cesar Soares Lucas wrote: >> Also, I believe you face some ideal graph inconsistencies because you capture information too early (before split_unique_types and following IGVN pass; and previous allocation eliminations during eliminate_macro_nodes() may contribute to that). > Can you please elaborate on that? There are places in the patch where you check for unintended graph modifications, like in `PhaseMacroExpand::eliminate_macro_nodes()`: // In some cases the region controlling the RAM might go away due to some simplification // of the IR graph. For now, we'll just bail out if this happens. if (n->in(0) == NULL || !n->in(0)->is_Region()) { C->record_failure(C2Compiler::retry_no_reduce_allocation_merges()); return; } I consider that an implementation peculiarity contributed by RAMs rather than an inherent complication coming from the transformation being performed. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Thu Oct 27 00:40:09 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Oct 2022 00:40:09 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 23:03:21 GMT, Vladimir Ivanov wrote: > > As of now, it serves dual purpose. It (1) marks a merge point as safe to be untangled during SR; and (2) caches information about field values. An other important purpose of RAM is to have information at SafePoints after merge point for reallocation during deoptimization. You need Klass information. I don't think having only Phis for values is enough. > > I believe you can solve it in a cleaner manner without introducing placeholder nodes and connection graph adjustments. IMO it's all about keeping escape status and properly handling "safe" merges in `split_unique_types`. > > One possible way to handle merge points is: > > * Handle merge points in `adjust_scalar_replaceable_state` and refrain from marking relevant bases as NSR when possible. > * After `adjust_scalar_replaceable_state` is over, every merge point should have all its inputs as either NSR or SR. > * `split_unique_types` incrementally builds value phis to eventually replace the base phi at merge point while processing SR allocations one by one. > * After `split_unique_types` is done, there are no merge points anymore, each allocation has a dedicated memory graph and allocation elimination can proceed as before. I am not sure how this could be possible. Currently EA rely on IGVN to propagate fields values based on unique memory slice. What you do with memory Load or Store nodes after merge point? Which memory slice you will use for them? > > Do you see any problems with such an approach? > > One thing still confuses me though: the patch mentions that RAMs can merge both eliminated and not-yet-eliminated allocations. What's the intended use case? I believe it's still required to have all merged allocations to be eventually eliminated. Do you try to handle the case during allocation elimination when part of the inputs are already eliminated and the rest is pending their turn? There is check for it in `ConnectionGraph::can_reduce_this_phi()`. The only supported cases is when no deoptimization point (SFP or UNCT) after merge point. It allow eliminate SR allocations even if they merge with NSR allocations. This was idea. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From dzhang at openjdk.org Thu Oct 27 02:30:47 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 27 Oct 2022 02:30:47 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. Message-ID: Hi, Some instructions previously had old assembler notation, but were renamed in RVV1.0[1][2] to be consistent with scalar instructions. We'd better keep the name the same as the new assembler mnemonics. [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#152-vector-count-population-in-mask-vcpopm [2] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#151-vector-mask-register-logical-instructions Please take a look and have some reviews. Thanks a lot. ## Testing: - hotspot and jdk tier1 on unmatched board without new failures ------------- Commit messages: - Rename some assembler intrinsic functions for RVV 1.0 Changes: https://git.openjdk.org/jdk/pull/10878/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10878&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295968 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10878.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10878/head:pull/10878 PR: https://git.openjdk.org/jdk/pull/10878 From fgao at openjdk.org Thu Oct 27 03:25:25 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 27 Oct 2022 03:25:25 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> Message-ID: <42-ePvNFFmt-Q1u13v1J6MyT_Qhi-bV1giOYXNh_nW4=.2e6ec671-29b5-46f4-982f-127d40d3a066@github.com> On Wed, 26 Oct 2022 15:11:36 GMT, Vladimir Kozlov wrote: > Please, also verify fix with `compiler/codegen/Test*Vect.java` tests which failed according to [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881) Thanks for your review @vnkozlov. Yes, I verified it on our internal aarch64 and x86 platforms enabling `-XX:+SuperWordRTDepCheck`. Without the fix, some `compiler/codegen/Test*Vect.java` tests failed, while with the fix, all these tests passed on both platforms. Do I need to update these testcase files with the option? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10868 From kvn at openjdk.org Thu Oct 27 04:04:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Oct 2022 04:04:54 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> Message-ID: On Wed, 26 Oct 2022 08:17:00 GMT, Fei Gao wrote: > `-XX:+SuperWordRTDepCheck` is a develop flag and misses proper implementation. But when enabled, it could change code path, resulting in the failures in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781) and [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881). As @vnkozlov suggested in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781), the small patch converts the flag to pure debug code to avoid effect on code generation. Let me test it (with this flag enabled by default) before you push. ------------- PR: https://git.openjdk.org/jdk/pull/10868 From kvn at openjdk.org Thu Oct 27 04:04:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Oct 2022 04:04:54 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <42-ePvNFFmt-Q1u13v1J6MyT_Qhi-bV1giOYXNh_nW4=.2e6ec671-29b5-46f4-982f-127d40d3a066@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> <42-ePvNFFmt-Q1u13v1J6MyT_Qhi-bV1giOYXNh_nW4=.2e6ec671-29b5-46f4-982f-127d40d3a066@github.com> Message-ID: On Thu, 27 Oct 2022 03:23:15 GMT, Fei Gao wrote: > Do I need to update these testcase files with the option? No need to update them. ------------- PR: https://git.openjdk.org/jdk/pull/10868 From kvn at openjdk.org Thu Oct 27 05:11:35 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Oct 2022 05:11:35 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v3] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 08:13:06 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request incrementally with two additional commits since the last revision: > > - fix > - always clone the Phi with address type Thank you for providing additional information. I need to look on this more. ------------- PR: https://git.openjdk.org/jdk/pull/9777 From thartmann at openjdk.org Thu Oct 27 05:25:02 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 27 Oct 2022 05:25:02 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> Message-ID: On Wed, 26 Oct 2022 08:17:00 GMT, Fei Gao wrote: > `-XX:+SuperWordRTDepCheck` is a develop flag and misses proper implementation. But when enabled, it could change code path, resulting in the failures in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781) and [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881). As @vnkozlov suggested in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781), the small patch converts the flag to pure debug code to avoid effect on code generation. Looks good to me too. Please wait until Vladimir's testing passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10868 From jiefu at openjdk.org Thu Oct 27 06:04:32 2022 From: jiefu at openjdk.org (Jie Fu) Date: Thu, 27 Oct 2022 06:04:32 GMT Subject: RFR: 8295762: [Vector API] Update generate_iota_indices for x86_32 after JDK-8293409 In-Reply-To: References: Message-ID: On Fri, 21 Oct 2022 04:31:09 GMT, Vladimir Kozlov wrote: > I can't comment on change. I assume it is copy from 64-bit code. But I am starting to concern about Vector API changes causing issues which were not caught during pre-integration testing. Unfortunately these tests run in [jdk_tier3](https://github.com/openjdk/jdk/blob/master/test/jdk/TEST.groups#L73) only. And as result are not part of GitHub Action testing. And in Oracle we don't test 32-bit. > > May I suggest in addition to currently run `tier1_part*` in GHA add `jdk_vector` to it. I looked on our internal testing times and all 3 `tier1_part*` and `jdk_vector` took about 5 min to run. Here is the PR which adds the jdk_vector in GHA: https://github.com/openjdk/jdk/pull/10879 . Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10807 From vlivanov at openjdk.org Thu Oct 27 07:02:28 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 27 Oct 2022 07:02:28 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 00:36:07 GMT, Vladimir Kozlov wrote: > An other important purpose of RAM is to have information at SafePoints after merge point for reallocation during deoptimization. You need Klass information. I don't think having only Phis for values is enough. Klass information is available either from Allocation node in `split_unique_types` or ConnectionGraph instance the Phi is part of. > I am not sure how this could be possible. Currently EA rely on IGVN to propagate fields values based on unique memory slice. What you do with memory Load or Store nodes after merge point? Which memory slice you will use for them? My understanding of how proposed approach is expected to work: merge points have to be simple enough to still allow splitting unique types for individual allocations. For example, `eliminate_ram_addp_use()` replaces `Load (AddP (Phi base1 ... basen) off) mem` with `Phi (val1 ... valn)` and `eliminate_reduced_allocation_merge()` performs similar transformation for `SafePoint`s. Alternatively, corresponding `Phi`s can be build incrementally while processing each individual `base` by `split_unique_types`. Or, just by splitting `Load`s through `Phi`: Load (AddP (Phi base_1 ... base_n) off) mem == split-through-phi ==> Phi ((Load (AddP base_1 off) mem) ... (Load (AddP base_n off) mem)) == split_unique_types ==> Phi ((Load (AddP base_1 off) mem_1) ... (Load (AddP base_n off) mem_n)) == IGVN ==> Phi (val_1 ... val_n) ``` > There is check for it in ConnectionGraph::can_reduce_this_phi(). The only supported cases is when no deoptimization point (SFP or UNCT) after merge point. It allow eliminate SR allocations even if they merge with NSR allocations. This was idea. That's nice! Now I see `has_call_as_user`-related code. It means that only `Load (AddP (Phi base_1 ... base_n) off) mem` shapes are allowed now. I believe the aforementioned split-through-phi transformation should handle it well: Load (AddP (Phi base_1 ... base_n) off) mem == split-through-phi ==> Phi ((Load (AddP base_1 off) mem) ... (Load (AddP base_n off) mem)) == split_unique_types ==> Phi (... (Load (AddP base_SR_i off) mem_i) ... (Load (AddP base_NSR_n off) mem) ...) == IGVN ==> Phi (... val_i ... (Load (AddP base_NSR_n off) mem) ... ) ------------- PR: https://git.openjdk.org/jdk/pull/9073 From zcai at openjdk.org Thu Oct 27 07:07:20 2022 From: zcai at openjdk.org (Zixian Cai) Date: Thu, 27 Oct 2022 07:07:20 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 02:23:13 GMT, Dingli Zhang wrote: > Hi, > > Some instructions previously had old assembler notation, but were renamed in RVV1.0[1][2] to be consistent with scalar instructions. We'd better keep the name the same as the new assembler mnemonics. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#152-vector-count-population-in-mask-vcpopm > [2] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#151-vector-mask-register-logical-instructions > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 on unmatched board without new failures Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. src/hotspot/cpu/riscv/assembler_riscv.hpp line 1139: > 1137: > 1138: // Vector Mask > 1139: INSN(vcpop_m, 0b1010111, 0b010, 0b10000, 0b010000); > The assembler instruction alias `vpopc.m` is being retained for software compatibility. src/hotspot/cpu/riscv/assembler_riscv.hpp line 1495: > 1493: // Vector Mask-Register Logical Instructions > 1494: INSN(vmxnor_mm, 0b1010111, 0b010, 0b1, 0b011111); > 1495: INSN(vmorn_mm, 0b1010111, 0b010, 0b1, 0b011100); > The old `vmandnot` and `vmornot` mnemonics can be retained as assembler aliases for compatibility. ------------- PR: https://git.openjdk.org/jdk/pull/10878 From dzhang at openjdk.org Thu Oct 27 07:47:52 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 27 Oct 2022 07:47:52 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 07:03:53 GMT, Zixian Cai wrote: > Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. Hi @caizixian, thanks for review! In a compiler (e.g. llvm) these alias need to be preserved because the assembly file only has instruction names, but does it also need to be preserved in a virtual machine like openjdk? If these older assembly mnemonics need to be retained as aliases, I think we can add it inside the macro assembler. ------------- PR: https://git.openjdk.org/jdk/pull/10878 From dzhang at openjdk.org Thu Oct 27 08:34:35 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 27 Oct 2022 08:34:35 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. [v2] In-Reply-To: References: Message-ID: <4MobhZVSEFDjnzhfa1KJ4P8SVLObW5j1qFKLDGgSlD4=.dc342bd5-f401-486d-af44-f476c4cb0d2f@github.com> > Hi, > > Some instructions previously had old assembler notation, but were renamed in RVV1.0[1][2] to be consistent with scalar instructions. We'd better keep the name the same as the new assembler mnemonics. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#152-vector-count-population-in-mask-vcpopm > [2] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#151-vector-mask-register-logical-instructions > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 on unmatched board without new failures Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Rename more intrinsic functions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10878/files - new: https://git.openjdk.org/jdk/pull/10878/files/8ff95140..b7b971c1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10878&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10878&range=00-01 Stats: 7 lines in 4 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10878.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10878/head:pull/10878 PR: https://git.openjdk.org/jdk/pull/10878 From dzhang at openjdk.org Thu Oct 27 08:39:53 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 27 Oct 2022 08:39:53 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. [v3] In-Reply-To: References: Message-ID: <6TA2jQl7mrwAh7K5aTG0cUIn4e1lOq0O-XnxK4DIXb4=.ad37b83d-e53d-4ffd-8eb9-0926d3d69dd7@github.com> > Hi, > > Some instructions previously had old assembler notation, but were renamed in RVV1.0[1][2] to be consistent with scalar instructions. We'd better keep the name the same as the new assembler mnemonics. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#152-vector-count-population-in-mask-vcpopm > [2] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#151-vector-mask-register-logical-instructions > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 on unmatched board without new failures Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Fix alignment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10878/files - new: https://git.openjdk.org/jdk/pull/10878/files/b7b971c1..2a51d342 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10878&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10878&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10878.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10878/head:pull/10878 PR: https://git.openjdk.org/jdk/pull/10878 From jbhateja at openjdk.org Thu Oct 27 09:39:32 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 27 Oct 2022 09:39:32 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Mon, 24 Oct 2022 22:09:29 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: > > extra whitespace character Few other non-algorithm change set comments. src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 22: > 20: * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA > 21: * or visit www.oracle.com if you need additional information or have any > 22: * questions. Of late stub code has been re-organized, to comply with it you may want to remove this file and merge macro-assembly code into a new file stubGenerator_x86_64_poly.cpp on the lines of src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 849: > 847: jcc(Assembler::less, L_process16Loop); > 848: > 849: poly1305_process_blocks_avx512(input, length, Since entire code is based on 512 bit encoding misalignment penalty may be costly here. A scalar peel handling (as done in tail) for input portion before a 64 byte aligned address could further improve the performance for large block sizes. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2040: > 2038: > 2039: address StubGenerator::generate_poly1305_processBlocks() { > 2040: __ align64(); This can be replaced by __ align(CodeEntryAlignment); src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: > 173: // Choice of 1024 is arbitrary, need enough data blocks to amortize conversion overhead > 174: // and not affect platforms without intrinsic support > 175: int blockMultipleLength = (len/BLOCK_LENGTH) * BLOCK_LENGTH; Since Poly processes 16 byte chunks, a strength reduced version of above expression could be len & (~(BLOCK_LEN-1) test/micro/org/openjdk/bench/javax/crypto/full/Poly1305DigestBench.java line 94: > 92: throw new RuntimeException(ex); > 93: } > 94: } On CLX patch shows performance regression of about 10% for block size 1024-2048+. CLX (Non-IFMA target) Baseline (JDK-20):- Benchmark (dataSize) (provider) Mode Cnt Score Error Units Poly1305DigestBench.digest 64 thrpt 2 3128928.978 ops/s Poly1305DigestBench.digest 256 thrpt 2 1526452.083 ops/s Poly1305DigestBench.digest 1024 thrpt 2 509267.401 ops/s Poly1305DigestBench.digest 2048 thrpt 2 305784.922 ops/s Poly1305DigestBench.digest 4096 thrpt 2 142175.885 ops/s Poly1305DigestBench.digest 8192 thrpt 2 72142.906 ops/s Poly1305DigestBench.digest 16384 thrpt 2 36357.000 ops/s Poly1305DigestBench.digest 1048576 thrpt 2 676.142 ops/s Withopt: Benchmark (dataSize) (provider) Mode Cnt Score Error Units Poly1305DigestBench.digest 64 thrpt 2 3136204.416 ops/s Poly1305DigestBench.digest 256 thrpt 2 1683221.124 ops/s Poly1305DigestBench.digest 1024 thrpt 2 457432.172 ops/s Poly1305DigestBench.digest 2048 thrpt 2 277563.817 ops/s Poly1305DigestBench.digest 4096 thrpt 2 149393.357 ops/s Poly1305DigestBench.digest 8192 thrpt 2 79463.734 ops/s Poly1305DigestBench.digest 16384 thrpt 2 41083.730 ops/s Poly1305DigestBench.digest 1048576 thrpt 2 705.419 ops/s ------------- PR: https://git.openjdk.org/jdk/pull/10582 From jbhateja at openjdk.org Thu Oct 27 09:39:33 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 27 Oct 2022 09:39:33 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Wed, 26 Oct 2022 21:11:33 GMT, Jamil Nimeh wrote: >> 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic. >> >> I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though. > > One small thing maybe: It doesn't look like R in `processMultipleBlocks` and `rbytes` ever changes, so maybe there's no need to repeatedly serialize/deserialize them on every call to engineUpdate? There is already an `r` that is attached to the object that is an IntegerModuloP. Could that be used in `processMultipleBlocks` and perhaps a private byte[] for a serialized r is also a field in Poly1305 that can be passed into the intrinsic method rather than creating it every time? It could be set in `setRSVals`. Perhaps we can recover a little performance there? > 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic. > > I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though. Do you suggest using white box APIs for CPU feature query during poly static initialization and perform multi block partitioning only for relevant platforms and keep the original implementation sacrosanct for other targets. VM does offer native white box primitives and currently its being used by tests infrastructure. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From zcai at openjdk.org Thu Oct 27 10:26:32 2022 From: zcai at openjdk.org (Zixian Cai) Date: Thu, 27 Oct 2022 10:26:32 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. [v3] In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 07:44:28 GMT, Dingli Zhang wrote: > > Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. > > Hi @caizixian, thanks for review! In a compiler (e.g. llvm) these alias need to be preserved because the assembly file only has instruction names, but does it also need to be preserved in a virtual machine like openjdk? If these older assembly mnemonics need to be retained as aliases, I think we can add it inside the macro assembler. @DingliZhang Good question. I can think of some possible use cases. 1. Someone has a fork and has existing modifications that use the vector instructions. When they merge the upstream into their fork, even though there's no merge conflicts, the code won't compile. Though you can argue that if they have a fork and maintain non-trivial changes, they should probably stay up-to-date with upstream changes. 2. When we interact with the rest of the ecosystem, for example, binutils via hsdis, old mnemonics might still be shown by other tools, so keeping the old name might help when someone does a text search of the OpenJDK code. 3. Somewhat related to 2, when someone tries to port existing assembly snippets (either handwritten or disassembling object files produced by gcc or LLVM), it's easier if the old mnemonics still exist. Just to be clear, I don't have any strong opinion regarding this. But based on my recent experience with porting some GC assembly code to RISC-V, I thought it would be nice if we can make the assembler is bit more friendly to help people transition from older RISC-V specs to newer ones. ------------- PR: https://git.openjdk.org/jdk/pull/10878 From fyang at openjdk.org Thu Oct 27 12:35:23 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 27 Oct 2022 12:35:23 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v3] In-Reply-To: References: Message-ID: <7K4kGCRTv5T4SgrFANTOboNHQZNWdk-TqGOt_Eiz0XM=.ff2db0c4-11cf-4afe-9ec9-9d1799838849@github.com> On Wed, 26 Oct 2022 09:04:48 GMT, Xiaolin Zheng wrote: >> The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: >> >> (dpow val1 (dlog val2)) >> >> LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. >> >> >> Reproducer: >> >> >> public class A { >> >> static int count = 0; >> >> public static void print(double var) { >> if (count % 10000 == 0) { >> System.out.println(var); >> } >> count++; >> } >> >> public static void a(double var1, double var2, double var3) { >> double var4 = Math.pow(var3, Math.log(var1 / var2)); >> print(var4); >> } >> >> public static void main(String[] args) { >> >> for (int i = 0; i < 50000; i++) { >> double var21 = 2.2250738585072014E-308D; >> double var15 = 1.1102230246251565E-16D; >> double d1 = 2.0D; >> A.a(var21, var15, d1); >> } >> >> } >> >> } >> >> >> The right answer is >> >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> >> >> The current backend gives >> >> 6.461124611136231E-203 >> NaN >> NaN >> NaN >> NaN >> >> >> Testing a hotspot tier1~4 on qemu. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Maybe a license test/hotspot/jtreg/compiler/floatingpoint/TestLibmIntrinsics.java line 57: > 55: > 56: static void proofread(double ans) { > 57: if (Math.abs(ans - expected) > 1e8) { If we want to do equivalence check here, shouldn't we compare with 1e-8 instead of 1e8? That said, do we really need this "proofread" check? I am assuming the check at line #80 is sufficient for this issue. ------------- PR: https://git.openjdk.org/jdk/pull/10867 From gcao at openjdk.org Thu Oct 27 13:08:04 2022 From: gcao at openjdk.org (Gui Cao) Date: Thu, 27 Oct 2022 13:08:04 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API Message-ID: Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. For example, AndReductionV is implemented as follows: diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad index 0ef36fdb292..c04962993c0 100644 --- a/src/hotspot/cpu/riscv/riscv_v.ad +++ b/src/hotspot/cpu/riscv/riscv_v.ad @@ -63,7 +63,6 @@ source %{ case Op_ExtractS: case Op_ExtractUB: // Vector API specific - case Op_AndReductionV: case Op_OrReductionV: case Op_XorReductionV: case Op_LoadVectorGather: @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ ins_pipe(pipe_slow); %} +// vector and reduction + +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); + match(Set dst (AndReductionV src1 src2)); + effect(TEMP tmp); + ins_cost(VEC_COST); + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" + "vredand.vs $tmp, $src2, $tmp\n\t" + "vmv.x.s $dst, $tmp" %} + ins_encode %{ + __ vsetvli(t0, x0, Assembler::e32); + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), + as_VectorRegister($tmp$$reg)); + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); + %} + ins_pipe(pipe_slow); +%} + +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); + match(Set dst (AndReductionV src1 src2)); + effect(TEMP tmp); + ins_cost(VEC_COST); + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" + "vredand.vs $tmp, $src2, $tmp\n\t" + "vmv.x.s $dst, $tmp" %} + ins_encode %{ + __ vsetvli(t0, x0, Assembler::e64); + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), + as_VectorRegister($tmp$$reg)); + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); + %} After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null 2b8 ld R30, [R14, #40] # class, #@loadKlass 2bc li R7, #-1 # int, #@loadConI 2c0 vmv.s.x V1, R7 #@reduce_andI vredand.vs V1, V2, V1 vmv.x.s R28, V1 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md ## Testing: - hotspot and jdk tier1 on unmatched board without new failures - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu ------------- Commit messages: - Add Reduction C2 instructions for Vector api Changes: https://git.openjdk.org/jdk/pull/10691/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295261 Stats: 117 lines in 1 file changed: 114 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10691.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10691/head:pull/10691 PR: https://git.openjdk.org/jdk/pull/10691 From jwaters at openjdk.org Thu Oct 27 13:08:05 2022 From: jwaters at openjdk.org (Julian Waters) Date: Thu, 27 Oct 2022 13:08:05 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: <6LfsG-iNOVac_mgEj0agsbL1uyHG94AhFIC9YEiF-FE=.b7153816-3f79-4e51-8463-ac09382d0fe3@github.com> On Thu, 13 Oct 2022 07:54:47 GMT, Gui Cao wrote: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu @robilad You help may be needed here :P ------------- PR: https://git.openjdk.org/jdk/pull/10691 From kvn at openjdk.org Thu Oct 27 15:32:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Oct 2022 15:32:24 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <42-ePvNFFmt-Q1u13v1J6MyT_Qhi-bV1giOYXNh_nW4=.2e6ec671-29b5-46f4-982f-127d40d3a066@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> <42-ePvNFFmt-Q1u13v1J6MyT_Qhi-bV1giOYXNh_nW4=.2e6ec671-29b5-46f4-982f-127d40d3a066@github.com> Message-ID: On Thu, 27 Oct 2022 03:23:15 GMT, Fei Gao wrote: >> Please, also verify fix with `compiler/codegen/Test*Vect.java` tests which failed according to [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881) > >> Please, also verify fix with `compiler/codegen/Test*Vect.java` tests which failed according to [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881) > > Thanks for your review @vnkozlov. Yes, I verified it on our internal aarch64 and x86 platforms enabling `-XX:+SuperWordRTDepCheck`. Without the fix, some `compiler/codegen/Test*Vect.java` tests failed, while with the fix, all these tests passed on both platforms. Do I need to update these testcase files with the option? Thanks. @fg1417 my testing passed. You can integrate. ------------- PR: https://git.openjdk.org/jdk/pull/10868 From duke at openjdk.org Thu Oct 27 19:52:36 2022 From: duke at openjdk.org (SuperCoder79) Date: Thu, 27 Oct 2022 19:52:36 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v5] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 13:41:12 GMT, SuperCoder79 wrote: >> Hello, >> I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: >> * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. >> * The removal of the memory load would have a beneficial effect in cache bound situations. >> * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. >> >> As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. >> >> I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. >> >> Thanks for your time, >> Jasmine > > SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review > > - Added interpreter assert Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/9642 From jnimeh at openjdk.org Thu Oct 27 21:21:33 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Thu, 27 Oct 2022 21:21:33 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Thu, 27 Oct 2022 09:22:03 GMT, Jatin Bhateja wrote: >> One small thing maybe: It doesn't look like R in `processMultipleBlocks` and `rbytes` ever changes, so maybe there's no need to repeatedly serialize/deserialize them on every call to engineUpdate? There is already an `r` that is attached to the object that is an IntegerModuloP. Could that be used in `processMultipleBlocks` and perhaps a private byte[] for a serialized r is also a field in Poly1305 that can be passed into the intrinsic method rather than creating it every time? It could be set in `setRSVals`. Perhaps we can recover a little performance there? > >> 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic. >> >> I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though. > > Do you suggest using white box APIs for CPU feature query during poly static initialization and perform multi block processing only for relevant platforms and keep the original implementation sacrosanct for other targets. VM does offer native white box primitives and currently its being used by tests infrastructure. No, going the WhiteBox route was not something I was thinking of. I sought feedback from a couple hotspot-knowledgable people about the use of WhiteBox APIs and both felt that it was not the right way to go. One said that WhiteBox is really for VM testing and not for these kinds of java classes. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Thu Oct 27 23:20:34 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Oct 2022 23:20:34 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work Message-ID: EA does not adjust NSR (not_scalar_replaceable) state for referenced allocations. In the test case object A is NSR because it merges with NULL object. But this state is not propagated to allocations it references. As result other allocations are marked scalar replaceable and related Load node is moved above guarding condition (where A object is checked for NULL). EA should propagate NSR state. Thanks to @rwestrel who provided reproducer test case. Testing tier1-4, xcomp, stress. ------------- Commit messages: - 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work Changes: https://git.openjdk.org/jdk/pull/10894/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10894&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8285835 Stats: 145 lines in 3 files changed: 135 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/10894.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10894/head:pull/10894 PR: https://git.openjdk.org/jdk/pull/10894 From duke at openjdk.org Fri Oct 28 03:20:07 2022 From: duke at openjdk.org (xpbob) Date: Fri, 28 Oct 2022 03:20:07 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse Message-ID: Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) ------------- Commit messages: - 8293785: Add a jtreg test for TraceOptoParse Changes: https://git.openjdk.org/jdk/pull/10898/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10898&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293785 Stats: 39 lines in 1 file changed: 39 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10898.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10898/head:pull/10898 PR: https://git.openjdk.org/jdk/pull/10898 From duke at openjdk.org Fri Oct 28 03:20:07 2022 From: duke at openjdk.org (xpbob) Date: Fri, 28 Oct 2022 03:20:07 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 03:11:49 GMT, xpbob wrote: > Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) Sorry for the late. Is this the jtreg test discussed here: https://github.com/openjdk/jdk/pull/10262#pullrequestreview-1106778703 ? @chhagedorn Thanks. Best regards, Bob ------------- PR: https://git.openjdk.org/jdk/pull/10898 From yadongwang at openjdk.org Fri Oct 28 03:20:29 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 28 Oct 2022 03:20:29 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v3] In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 09:04:48 GMT, Xiaolin Zheng wrote: >> The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: >> >> (dpow val1 (dlog val2)) >> >> LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. >> >> >> Reproducer: >> >> >> public class A { >> >> static int count = 0; >> >> public static void print(double var) { >> if (count % 10000 == 0) { >> System.out.println(var); >> } >> count++; >> } >> >> public static void a(double var1, double var2, double var3) { >> double var4 = Math.pow(var3, Math.log(var1 / var2)); >> print(var4); >> } >> >> public static void main(String[] args) { >> >> for (int i = 0; i < 50000; i++) { >> double var21 = 2.2250738585072014E-308D; >> double var15 = 1.1102230246251565E-16D; >> double d1 = 2.0D; >> A.a(var21, var15, d1); >> } >> >> } >> >> } >> >> >> The right answer is >> >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> >> >> The current backend gives >> >> 6.461124611136231E-203 >> NaN >> NaN >> NaN >> NaN >> >> >> Testing a hotspot tier1~4 on qemu. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Maybe a license @zhengxiaolinX Nice catch. LGTM. ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/10867 From xlinzheng at openjdk.org Fri Oct 28 03:53:27 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 28 Oct 2022 03:53:27 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v4] In-Reply-To: References: Message-ID: > The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: > > (dpow val1 (dlog val2)) > > LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. > > > Reproducer: > > > public class A { > > static int count = 0; > > public static void print(double var) { > if (count % 10000 == 0) { > System.out.println(var); > } > count++; > } > > public static void a(double var1, double var2, double var3) { > double var4 = Math.pow(var3, Math.log(var1 / var2)); > print(var4); > } > > public static void main(String[] args) { > > for (int i = 0; i < 50000; i++) { > double var21 = 2.2250738585072014E-308D; > double var15 = 1.1102230246251565E-16D; > double d1 = 2.0D; > A.a(var21, var15, d1); > } > > } > > } > > > The right answer is > > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > > > The current backend gives > > 6.461124611136231E-203 > NaN > NaN > NaN > NaN > > > Testing a hotspot tier1~4 on qemu. > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Fix the newly-added test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10867/files - new: https://git.openjdk.org/jdk/pull/10867/files/01e54b45..9334ce71 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10867&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10867&range=02-03 Stats: 9 lines in 1 file changed: 0 ins; 9 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10867.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10867/head:pull/10867 PR: https://git.openjdk.org/jdk/pull/10867 From xlinzheng at openjdk.org Fri Oct 28 03:53:29 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 28 Oct 2022 03:53:29 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v3] In-Reply-To: <7K4kGCRTv5T4SgrFANTOboNHQZNWdk-TqGOt_Eiz0XM=.ff2db0c4-11cf-4afe-9ec9-9d1799838849@github.com> References: <7K4kGCRTv5T4SgrFANTOboNHQZNWdk-TqGOt_Eiz0XM=.ff2db0c4-11cf-4afe-9ec9-9d1799838849@github.com> Message-ID: On Thu, 27 Oct 2022 12:33:13 GMT, Fei Yang wrote: >> Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> Maybe a license > > test/hotspot/jtreg/compiler/floatingpoint/TestLibmIntrinsics.java line 57: > >> 55: >> 56: static void proofread(double ans) { >> 57: if (Math.abs(ans - expected) > 1e8) { > > If we want to do equivalence check here, shouldn't we compare with 1e-8 instead of 1e8? > That said, do we really need this "proofread" check? I am assuming the check at line #80 is sufficient for this issue. Sorry for the error - yes, it should have been `1e-8`, but somehow the `-` ran away when it was written :-). And yes again: only the former check matters, so it is removed. ------------- PR: https://git.openjdk.org/jdk/pull/10867 From dzhang at openjdk.org Fri Oct 28 05:47:20 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 28 Oct 2022 05:47:20 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. [v3] In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 10:24:15 GMT, Zixian Cai wrote: >>> Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. >> >> Hi @caizixian, thanks for review! >> In a compiler (e.g. llvm) these alias need to be preserved because the assembly file only has instruction names, but does it also need to be preserved in a virtual machine like openjdk? >> If these older assembly mnemonics need to be retained as aliases, I think we can add it inside the macro assembler. > >> > Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. >> >> Hi @caizixian, thanks for review! In a compiler (e.g. llvm) these alias need to be preserved because the assembly file only has instruction names, but does it also need to be preserved in a virtual machine like openjdk? If these older assembly mnemonics need to be retained as aliases, I think we can add it inside the macro assembler. > > @DingliZhang Good question. I can think of some possible use cases. > > 1. Someone has a fork and has existing modifications that use the vector instructions. When they merge the upstream into their fork, even though there's no merge conflicts, the code won't compile. Though you can argue that if they have a fork and maintain non-trivial changes, they should probably stay up-to-date with upstream changes. > 2. When we interact with the rest of the ecosystem, for example, binutils via hsdis, old mnemonics might still be shown by other tools, so keeping the old name might help when someone does a text search of the OpenJDK code. > 3. Somewhat related to 2, when someone tries to port existing assembly snippets (either handwritten or disassembling object files produced by gcc or LLVM), it's easier if the old mnemonics still exist. > > Just to be clear, I don't have any strong opinion regarding this. But based on my recent experience with porting some GC assembly code to RISC-V, I thought it would be nice if we can make the assembler is bit more friendly to help people transition from older RISC-V specs to newer ones. @caizixian Thank you for your cases! I got your point. Let's wait for reviewers a bit more. ------------- PR: https://git.openjdk.org/jdk/pull/10878 From fgao at openjdk.org Fri Oct 28 06:10:25 2022 From: fgao at openjdk.org (Fei Gao) Date: Fri, 28 Oct 2022 06:10:25 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <42-ePvNFFmt-Q1u13v1J6MyT_Qhi-bV1giOYXNh_nW4=.2e6ec671-29b5-46f4-982f-127d40d3a066@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> <42-ePvNFFmt-Q1u13v1J6MyT_Qhi-bV1giOYXNh_nW4=.2e6ec671-29b5-46f4-982f-127d40d3a066@github.com> Message-ID: On Thu, 27 Oct 2022 03:23:15 GMT, Fei Gao wrote: >> Please, also verify fix with `compiler/codegen/Test*Vect.java` tests which failed according to [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881) > >> Please, also verify fix with `compiler/codegen/Test*Vect.java` tests which failed according to [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881) > > Thanks for your review @vnkozlov. Yes, I verified it on our internal aarch64 and x86 platforms enabling `-XX:+SuperWordRTDepCheck`. Without the fix, some `compiler/codegen/Test*Vect.java` tests failed, while with the fix, all these tests passed on both platforms. Do I need to update these testcase files with the option? Thanks. > @fg1417 my testing passed. You can integrate. Thanks for your review and test work, @vnkozlov @TobiHartmann. ------------- PR: https://git.openjdk.org/jdk/pull/10868 From fyang at openjdk.org Fri Oct 28 06:27:24 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Oct 2022 06:27:24 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v4] In-Reply-To: References: Message-ID: <8rREBS9qf8XZOk9RhyQMPpR_-9JZ2c69THgtsLD9MII=.67091c46-44f2-4b95-b2ea-c37ae3ce8c6b@github.com> On Fri, 28 Oct 2022 03:53:27 GMT, Xiaolin Zheng wrote: >> The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: >> >> (dpow val1 (dlog val2)) >> >> LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. >> >> >> Reproducer: >> >> >> public class A { >> >> static int count = 0; >> >> public static void print(double var) { >> if (count % 10000 == 0) { >> System.out.println(var); >> } >> count++; >> } >> >> public static void a(double var1, double var2, double var3) { >> double var4 = Math.pow(var3, Math.log(var1 / var2)); >> print(var4); >> } >> >> public static void main(String[] args) { >> >> for (int i = 0; i < 50000; i++) { >> double var21 = 2.2250738585072014E-308D; >> double var15 = 1.1102230246251565E-16D; >> double d1 = 2.0D; >> A.a(var21, var15, d1); >> } >> >> } >> >> } >> >> >> The right answer is >> >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> >> >> The current backend gives >> >> 6.461124611136231E-203 >> NaN >> NaN >> NaN >> NaN >> >> >> Testing a hotspot tier1~4 on qemu. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Fix the newly-added test Updated change looks good. Thanks for finding and fixing this. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/10867 From bulasevich at openjdk.org Fri Oct 28 06:32:45 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 28 Oct 2022 06:32:45 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes [v2] In-Reply-To: References: Message-ID: > 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: adding CodeSection._const_section_alignment field instead of 64 bytes alignment for all JVMCI nmethods ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10392/files - new: https://git.openjdk.org/jdk/pull/10392/files/80693ef8..f743a421 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=00-01 Stats: 68 lines in 5 files changed: 25 ins; 36 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10392.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10392/head:pull/10392 PR: https://git.openjdk.org/jdk/pull/10392 From bulasevich at openjdk.org Fri Oct 28 06:50:44 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 28 Oct 2022 06:50:44 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes [v3] In-Reply-To: References: Message-ID: <344HV9b4AE7T8qNZktGvimF3r0XQ9PHnd8h0HqMoeJk=.256fd4a9-2598-48bf-bc91-8ce5a7d21ac1@github.com> > 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - adding CodeSection._const_section_alignment field instead of 64 bytes alignment for all JVMCI nmethods - 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes ------------- Changes: https://git.openjdk.org/jdk/pull/10392/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=02 Stats: 53 lines in 4 files changed: 26 ins; 20 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10392.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10392/head:pull/10392 PR: https://git.openjdk.org/jdk/pull/10392 From xlinzheng at openjdk.org Fri Oct 28 06:52:23 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 28 Oct 2022 06:52:23 GMT Subject: RFR: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic [v4] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 03:53:27 GMT, Xiaolin Zheng wrote: >> The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: >> >> (dpow val1 (dlog val2)) >> >> LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. >> >> >> Reproducer: >> >> >> public class A { >> >> static int count = 0; >> >> public static void print(double var) { >> if (count % 10000 == 0) { >> System.out.println(var); >> } >> count++; >> } >> >> public static void a(double var1, double var2, double var3) { >> double var4 = Math.pow(var3, Math.log(var1 / var2)); >> print(var4); >> } >> >> public static void main(String[] args) { >> >> for (int i = 0; i < 50000; i++) { >> double var21 = 2.2250738585072014E-308D; >> double var15 = 1.1102230246251565E-16D; >> double d1 = 2.0D; >> A.a(var21, var15, d1); >> } >> >> } >> >> } >> >> >> The right answer is >> >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> 6.461124611136231E-203 >> >> >> The current backend gives >> >> 6.461124611136231E-203 >> NaN >> NaN >> NaN >> NaN >> >> >> Testing a hotspot tier1~4 on qemu. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Fix the newly-added test Thank you all for your reviews and comments! ------------- PR: https://git.openjdk.org/jdk/pull/10867 From yadongwang at openjdk.org Fri Oct 28 07:03:25 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 28 Oct 2022 07:03:25 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. [v3] In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 10:24:15 GMT, Zixian Cai wrote: > > > Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. > > > > > > Hi @caizixian, thanks for review! In a compiler (e.g. llvm) these alias need to be preserved because the assembly file only has instruction names, but does it also need to be preserved in a virtual machine like openjdk? If these older assembly mnemonics need to be retained as aliases, I think we can add it inside the macro assembler. > > @DingliZhang Good question. I can think of some possible use cases. > > 1. Someone has a fork and has existing modifications that use the vector instructions. When they merge the upstream into their fork, even though there's no merge conflicts, the code won't compile. Though you can argue that if they have a fork and maintain non-trivial changes, they should probably stay up-to-date with upstream changes. > 2. When we interact with the rest of the ecosystem, for example, binutils via hsdis, old mnemonics might still be shown by other tools, so keeping the old name might help when someone does a text search of the OpenJDK code. > 3. Somewhat related to 2, when someone tries to port existing assembly snippets (either handwritten or disassembling object files produced by gcc or LLVM), it's easier if the old mnemonics still exist. > > Just to be clear, I don't have any strong opinion regarding this. But based on my recent experience with porting some GC assembly code to RISC-V, I thought it would be nice if we can make the assembler is bit more friendly to help people transition from older RISC-V specs to newer ones. Good thinking. But I think it's better to be consistent with RVV 1.0, the first release version we claim to support in riscv port. (Although we know that rvv has unfortunately fragmented in the real world.) ------------- PR: https://git.openjdk.org/jdk/pull/10878 From dlong at openjdk.org Fri Oct 28 07:18:28 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 28 Oct 2022 07:18:28 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes [v3] In-Reply-To: <344HV9b4AE7T8qNZktGvimF3r0XQ9PHnd8h0HqMoeJk=.256fd4a9-2598-48bf-bc91-8ce5a7d21ac1@github.com> References: <344HV9b4AE7T8qNZktGvimF3r0XQ9PHnd8h0HqMoeJk=.256fd4a9-2598-48bf-bc91-8ce5a7d21ac1@github.com> Message-ID: On Fri, 28 Oct 2022 06:50:44 GMT, Boris Ulasevich wrote: >> 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - adding CodeSection._const_section_alignment field instead of 64 bytes alignment for all JVMCI nmethods > - 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Looks good. ------------- Marked as reviewed by dlong (Reviewer). PR: https://git.openjdk.org/jdk/pull/10392 From roland at openjdk.org Fri Oct 28 07:25:26 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 28 Oct 2022 07:25:26 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 23:14:11 GMT, Vladimir Kozlov wrote: > EA does not adjust NSR (not_scalar_replaceable) state for referenced allocations. > In the test case object A is NSR because it merges with NULL object. But this state is not propagated to allocations it references. As result other allocations are marked scalar replaceable and related Load node is moved above guarding condition (where A object is checked for NULL). > EA should propagate NSR state. > > Thanks to @rwestrel who provided reproducer test case. > > Testing tier1-4, xcomp, stress. Looks reasonable to me. Have you verified that the failure with the replay file and jars from the initial crash doesn't reproduce either? ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk/pull/10894 From fgao at openjdk.org Fri Oct 28 07:31:18 2022 From: fgao at openjdk.org (Fei Gao) Date: Fri, 28 Oct 2022 07:31:18 GMT Subject: Integrated: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> Message-ID: On Wed, 26 Oct 2022 08:17:00 GMT, Fei Gao wrote: > `-XX:+SuperWordRTDepCheck` is a develop flag and misses proper implementation. But when enabled, it could change code path, resulting in the failures in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781) and [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881). As @vnkozlov suggested in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781), the small patch converts the flag to pure debug code to avoid effect on code generation. This pull request has now been integrated. Changeset: 4b89fce0 Author: Fei Gao Committer: Pengfei Li URL: https://git.openjdk.org/jdk/commit/4b89fce0831f990d4b6af5e6e208342f68aed614 Stats: 4 lines in 2 files changed: 1 ins; 0 del; 3 mod 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10868 From uschindler at openjdk.org Fri Oct 28 07:38:05 2022 From: uschindler at openjdk.org (Uwe Schindler) Date: Fri, 28 Oct 2022 07:38:05 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 07:23:12 GMT, Roland Westrelin wrote: > Looks reasonable to me. Have you verified that the failure with the replay file and jars from the initial crash doesn't reproduce either? I had the same question. I commented on the issue. It would be good to understand how the test code relates to the code that fails in Lucene and possibly Ben Manes' Caffeine. Maybe a short explanation what is A, B, C in Lucene's code and which loop is affected. ------------- PR: https://git.openjdk.org/jdk/pull/10894 From fgao at openjdk.org Fri Oct 28 07:50:18 2022 From: fgao at openjdk.org (Fei Gao) Date: Fri, 28 Oct 2022 07:50:18 GMT Subject: RFR: 8291781: assert(!is_visited) failed: visit only once with -XX:+SuperWordRTDepCheck In-Reply-To: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> References: <4FLpNNSA_3fCYWK0-_5EUq35VUt3kKNu_tCw8v3l0co=.8df43322-a806-4a3c-bc3f-96e184b47914@github.com> Message-ID: <7fSf0LKxzUuYfLIVaK-r5qDBEkoaKCD_VhTYLzttcFw=.1886eebf-5ca1-41cc-bf4b-db74a1d04db8@github.com> On Wed, 26 Oct 2022 08:17:00 GMT, Fei Gao wrote: > `-XX:+SuperWordRTDepCheck` is a develop flag and misses proper implementation. But when enabled, it could change code path, resulting in the failures in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781) and [JDK-8291881](https://bugs.openjdk.org/browse/JDK-8291881). As @vnkozlov suggested in [JDK-8291781](https://bugs.openjdk.org/browse/JDK-8291781), the small patch converts the flag to pure debug code to avoid effect on code generation. The commit message edited by bot shows only one of reviewers. That's weird. ------------- PR: https://git.openjdk.org/jdk/pull/10868 From chagedorn at openjdk.org Fri Oct 28 08:32:33 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Oct 2022 08:32:33 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 03:11:49 GMT, xpbob wrote: > Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) Hi Bob > Is this the jtreg test discussed here: https://github.com/openjdk/jdk/pull/10262#pullrequestreview-1106778703 ? @chhagedorn Yes, exactly. Thanks for adding such a sanity test! test/hotspot/jtreg/compiler/print/TestTraceOptoParse.java line 28: > 26: * @bug 8293785 > 27: * @summary test for -XX:+TraceOptoParse > 28: * @requires vm.debug You should also add `vm.compiler2.enabled` as the flag is C2 specific: Suggestion: * @requires vm.debug & vm.compiler2.enabled ------------- Changes requested by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10898 From roland at openjdk.org Fri Oct 28 08:33:07 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 28 Oct 2022 08:33:07 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 07:34:53 GMT, Uwe Schindler wrote: > Looks reasonable to me. Have you verified that the failure with the replay file and jars from the initial crash doesn't reproduce either? I tried the replay from the bug and the crash is gone with this fix. ------------- PR: https://git.openjdk.org/jdk/pull/10894 From chagedorn at openjdk.org Fri Oct 28 08:34:36 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Oct 2022 08:34:36 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v5] In-Reply-To: References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Wed, 19 Oct 2022 08:19:16 GMT, Christian Hagedorn wrote: >> This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: >> >> https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 >> >> The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. >> >> ## How does it work? >> >> ### Basic idea >> There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: >> >> >> int iFld; >> >> @Test >> @IR(counts = {IRNode.STORE_I, "1"}, >> phase = {CompilePhase.AFTER_PARSING, // Fails >> CompilePhase.ITER_GVN1}) // Works >> public void optimizeStores() { >> iFld = 42; >> iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 >> } >> >> In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: >> >> 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" >> > Phase "After Parsing": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" >> - Failed comparison: [found] 2 = 1 [given] >> - Matched nodes (2): >> * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) >> * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) >> >> >> More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. >> >> ### CompilePhase.DEFAULT - default compile phase >> The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). >> >> Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. >> >> Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. >> >> ### Different regexes for the same IRNode entry >> A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: >> >> - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node >> public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node >> >> - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; >> static { >> String idealIndependentRegex = START + "Allocate" + MID + END; >> String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; >> allocNodes(ALLOC, idealIndependentRegex, optoRegex); >> } >> >> **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** >> >> ### Using the IRNode entries correctly >> The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: >> - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). >> - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). >> - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. >> >> ## General Changes >> The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: >> >> - Added more packages to better group related classes together. >> - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. >> - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). >> - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) >> - Cleaned up and refactored a lot of code to use this new design. >> - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. >> - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. >> - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. >> - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. >> - Replaced implementation inheritance by interfaces. >> - Improved encapsulation of object data. >> - Updated README and many comments/class descriptions to reflect this new feature. >> - Added new IR framework tests >> >> ## Testing >> - Normal tier testing. >> - Applying the patch to Valhalla to perform tier testing. >> - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 83 commits: > > - Fix TestVectorConditionalMove > - Merge branch 'master' into JDK-8280378 > - Hao's patch to address review comments > - Roberto's review comments > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java > > Co-authored-by: Roberto Casta?eda Lozano > - Merge branch 'master' into JDK-8280378 > - Fix missing counts indentation in failure messages > - Update comments > - ... and 73 more: https://git.openjdk.org/jdk/compare/f502ab85...ae7190c4 Thanks Roberto for your review! I'll merge master again on Monday and start some last testing before integration. ------------- PR: https://git.openjdk.org/jdk/pull/10695 From duke at openjdk.org Fri Oct 28 08:52:41 2022 From: duke at openjdk.org (xpbob) Date: Fri, 28 Oct 2022 08:52:41 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse [v2] In-Reply-To: References: Message-ID: > Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) xpbob has updated the pull request incrementally with one additional commit since the last revision: add c2 flag ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10898/files - new: https://git.openjdk.org/jdk/pull/10898/files/82bd1d93..c23a321a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10898&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10898&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10898.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10898/head:pull/10898 PR: https://git.openjdk.org/jdk/pull/10898 From duke at openjdk.org Fri Oct 28 08:55:49 2022 From: duke at openjdk.org (xpbob) Date: Fri, 28 Oct 2022 08:55:49 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse [v2] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 08:52:41 GMT, xpbob wrote: >> Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) > > xpbob has updated the pull request incrementally with one additional commit since the last revision: > > add c2 flag Thanks for review. @chhagedorn The code has been updated ------------- PR: https://git.openjdk.org/jdk/pull/10898 From uschindler at openjdk.org Fri Oct 28 08:57:25 2022 From: uschindler at openjdk.org (Uwe Schindler) Date: Fri, 28 Oct 2022 08:57:25 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 08:30:56 GMT, Roland Westrelin wrote: > > Looks reasonable to me. Have you verified that the failure with the replay file and jars from the initial crash doesn't reproduce either? > > I tried the replay from the bug and the crash is gone with this fix. Thanks for confirmation. See also my comments here: https://bugs.openjdk.org/browse/JDK-8285835?focusedCommentId=14533010&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14533010 The test code in this PR looks like our Lucene code (A wraps/refers B wraps/refers C; all some nested Lucene DocValues wrappers). ------------- PR: https://git.openjdk.org/jdk/pull/10894 From chagedorn at openjdk.org Fri Oct 28 09:01:09 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Oct 2022 09:01:09 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v5] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 13:41:12 GMT, SuperCoder79 wrote: >> Hello, >> I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: >> * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. >> * The removal of the memory load would have a beneficial effect in cache bound situations. >> * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. >> >> As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. >> >> I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. >> >> Thanks for your time, >> Jasmine > > SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review > > - Added interpreter assert Looks good! I quickly run some sanity testing again and then sponsor it. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9642 From bulasevich at openjdk.org Fri Oct 28 10:03:46 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 28 Oct 2022 10:03:46 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes [v4] In-Reply-To: References: Message-ID: > 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: arm32 fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10392/files - new: https://git.openjdk.org/jdk/pull/10392/files/59ff8c44..102e33c7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=02-03 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10392.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10392/head:pull/10392 PR: https://git.openjdk.org/jdk/pull/10392 From chagedorn at openjdk.org Fri Oct 28 10:08:09 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Oct 2022 10:08:09 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v5] In-Reply-To: References: Message-ID: <_6GQEtva2x1vj8ojrJ-mCaNKCf9aimB4GtifPxdvtWo=.812b3415-0f8a-40ec-83c2-9cc2451ac608@github.com> On Mon, 24 Oct 2022 13:41:12 GMT, SuperCoder79 wrote: >> Hello, >> I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: >> * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. >> * The removal of the memory load would have a beneficial effect in cache bound situations. >> * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. >> >> As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. >> >> I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. >> >> Thanks for your time, >> Jasmine > > SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review > > - Added interpreter assert Testing looked good! ------------- PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Fri Oct 28 10:08:11 2022 From: duke at openjdk.org (SuperCoder79) Date: Fri, 28 Oct 2022 10:08:11 GMT Subject: Integrated: 8291336: Add ideal rule to convert floating point multiply by 2 into addition In-Reply-To: References: Message-ID: <9iUP2UM-DSlgwc2QiNd1A45GQGzonjmGA06T3lxcb2I=.b658677f-4155-4bd3-9d40-157ec735beb6@github.com> On Tue, 26 Jul 2022 15:39:42 GMT, SuperCoder79 wrote: > Hello, > I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: > * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. > * The removal of the memory load would have a beneficial effect in cache bound situations. > * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. > > As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. > > I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. > > Thanks for your time, > Jasmine This pull request has now been integrated. Changeset: cf5546b3 Author: SuperCoder79 <25208576+SuperCoder7979 at users.noreply.github.com> Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/cf5546b3ac63e305c0b9d040353503fb33d6ad7a Stats: 154 lines in 4 files changed: 154 ins; 0 del; 0 mod 8291336: Add ideal rule to convert floating point multiply by 2 into addition Reviewed-by: qamai, thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/9642 From dnsimon at openjdk.org Fri Oct 28 10:16:19 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 28 Oct 2022 10:16:19 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes [v4] In-Reply-To: References: Message-ID: <_owwqtFSHNCLByB8hIE_DfOFgmfkaykOr0MTjcxlnKA=.66a9f7cc-597c-4493-b71b-f6ae1492c5aa@github.com> On Fri, 28 Oct 2022 10:03:46 GMT, Boris Ulasevich wrote: >> 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > arm32 fix Marked as reviewed by dnsimon (Committer). src/hotspot/share/asm/codeBuffer.hpp line 457: > 455: _stubs.initialize_outer(this, SECT_STUBS); > 456: > 457: // default value. should be changed if vectorization requires large aligned constants I would not make this comment so vectorization specific. Just say something like: // Default is to align on 8 bytes. A compiler can change this // if larger alignment (e.g., 32-byte vector masks) is required. ------------- PR: https://git.openjdk.org/jdk/pull/10392 From jiefu at openjdk.org Fri Oct 28 11:30:03 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 28 Oct 2022 11:30:03 GMT Subject: RFR: 8296030: compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs after JDK-8291781 Message-ID: Hi all, compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs due to 'SuperWordRTDepCheck' is develop and is available only in debug version of VM. To fix it, `-XX:+IgnoreUnrecognizedVMOptions` is added in the test. Thanks. Best regards, Jie ------------- Commit messages: - 8296030: compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs after JDK-8291781 Changes: https://git.openjdk.org/jdk/pull/10900/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10900&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296030 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10900.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10900/head:pull/10900 PR: https://git.openjdk.org/jdk/pull/10900 From xlinzheng at openjdk.org Fri Oct 28 12:00:41 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 28 Oct 2022 12:00:41 GMT Subject: Integrated: 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 04:57:11 GMT, Xiaolin Zheng wrote: > The ported logic of LIRGenerator::do_LibmIntrinsic has a correctness problem, which will kill argument registers when the current libm intrinsic's operand is also a libm intrinsic, such as: > > (dpow val1 (dlog val2)) > > LIRItem walks operands, so the `value.load_item_force(cc->at(0));` should be moved below after the LIRItem, or the result of `cc->at(0)` would be killed. But we might as well keep aligning AArch64's style to reduce some maintenance work. > > > Reproducer: > > > public class A { > > static int count = 0; > > public static void print(double var) { > if (count % 10000 == 0) { > System.out.println(var); > } > count++; > } > > public static void a(double var1, double var2, double var3) { > double var4 = Math.pow(var3, Math.log(var1 / var2)); > print(var4); > } > > public static void main(String[] args) { > > for (int i = 0; i < 50000; i++) { > double var21 = 2.2250738585072014E-308D; > double var15 = 1.1102230246251565E-16D; > double d1 = 2.0D; > A.a(var21, var15, d1); > } > > } > > } > > > The right answer is > > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > 6.461124611136231E-203 > > > The current backend gives > > 6.461124611136231E-203 > NaN > NaN > NaN > NaN > > > Testing a hotspot tier1~4 on qemu. > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: 1fdbb1ba Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/1fdbb1ba337b07dbcfb3c28c4fdeba74fee113dc Stats: 100 lines in 2 files changed: 95 ins; 4 del; 1 mod 8295926: RISC-V: C1: Fix LIRGenerator::do_LibmIntrinsic Reviewed-by: yadongwang, fyang ------------- PR: https://git.openjdk.org/jdk/pull/10867 From chagedorn at openjdk.org Fri Oct 28 12:26:31 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Oct 2022 12:26:31 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse [v2] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 08:52:41 GMT, xpbob wrote: >> Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) > > xpbob has updated the pull request incrementally with one additional commit since the last revision: > > add c2 flag That looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10898 From chagedorn at openjdk.org Fri Oct 28 12:28:28 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Oct 2022 12:28:28 GMT Subject: RFR: 8296030: compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs after JDK-8291781 In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 11:21:07 GMT, Jie Fu wrote: > Hi all, > > compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs due to 'SuperWordRTDepCheck' is develop and is available only in debug version of VM. > To fix it, `-XX:+IgnoreUnrecognizedVMOptions` is added in the test. > > Thanks. > Best regards, > Jie Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10900 From roland at openjdk.org Fri Oct 28 12:42:29 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 28 Oct 2022 12:42:29 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization Message-ID: This change is mostly the same I sent for review 3 years ago but was never integrated: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html The main difference is that, in the meantime, I submitted a couple of refactoring changes extracted from the 2019 patch: 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses As a result, the current patch is much smaller (but still not small). The implementation is otherwise largely the same as in the 2019 patch. I tried to remove some of the code duplication between the TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic shared in template methods. In the 2019 patch, interfaces were trusted when types were constructed and I had added code to drop interfaces from a type where they couldn't be trusted. This new patch proceeds the other way around: interfaces are not trusted when a type is constructed and code that uses the type must explicitly request that they are included (this was suggested as an improvement by Vladimir Ivanov I think). ------------- Commit messages: - whitespaces - interfaces Changes: https://git.openjdk.org/jdk/pull/10901/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=00 Issue: https://bugs.openjdk.org/browse/JDK-6312651 Stats: 1542 lines in 20 files changed: 702 ins; 491 del; 349 mod Patch: https://git.openjdk.org/jdk/pull/10901.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10901/head:pull/10901 PR: https://git.openjdk.org/jdk/pull/10901 From jiefu at openjdk.org Fri Oct 28 12:50:23 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 28 Oct 2022 12:50:23 GMT Subject: RFR: 8296030: compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs after JDK-8291781 In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 12:26:19 GMT, Christian Hagedorn wrote: > Looks good and trivial! Thanks @chhagedorn for the review. ------------- PR: https://git.openjdk.org/jdk/pull/10900 From jiefu at openjdk.org Fri Oct 28 12:50:23 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 28 Oct 2022 12:50:23 GMT Subject: Integrated: 8296030: compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs after JDK-8291781 In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 11:21:07 GMT, Jie Fu wrote: > Hi all, > > compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs due to 'SuperWordRTDepCheck' is develop and is available only in debug version of VM. > To fix it, `-XX:+IgnoreUnrecognizedVMOptions` is added in the test. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 754bd531 Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/754bd53137a1c596e6f1a7debb847cd563d95699 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296030: compiler/c2/irTests/TestVectorizeTypeConversion.java fails with release VMs after JDK-8291781 Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10900 From jiefu at openjdk.org Fri Oct 28 13:16:48 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 28 Oct 2022 13:16:48 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse [v2] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 08:52:41 GMT, xpbob wrote: >> Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) > > xpbob has updated the pull request incrementally with one additional commit since the last revision: > > add c2 flag Looks good to me too. Thanks for adding this test. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.org/jdk/pull/10898 From duke at openjdk.org Fri Oct 28 13:16:48 2022 From: duke at openjdk.org (xpbob) Date: Fri, 28 Oct 2022 13:16:48 GMT Subject: RFR: 8293785: Add a jtreg test for TraceOptoParse [v2] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 12:22:28 GMT, Christian Hagedorn wrote: >> xpbob has updated the pull request incrementally with one additional commit since the last revision: >> >> add c2 flag > > That looks good! @chhagedorn @DamonFool Thanks for review. ------------- PR: https://git.openjdk.org/jdk/pull/10898 From duke at openjdk.org Fri Oct 28 13:21:31 2022 From: duke at openjdk.org (xpbob) Date: Fri, 28 Oct 2022 13:21:31 GMT Subject: Integrated: 8293785: Add a jtreg test for TraceOptoParse In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 03:11:49 GMT, xpbob wrote: > Add a jtreg test for TraceOptoParse after [JDK-8293774](https://bugs.openjdk.org/browse/JDK-8293774) This pull request has now been integrated. Changeset: 823fd4a9 Author: bobpengxie Committer: Jie Fu URL: https://git.openjdk.org/jdk/commit/823fd4a9dff52e8072b032ae6ddcab74d118185a Stats: 39 lines in 1 file changed: 39 ins; 0 del; 0 mod 8293785: Add a jtreg test for TraceOptoParse Reviewed-by: chagedorn, jiefu ------------- PR: https://git.openjdk.org/jdk/pull/10898 From tholenstein at openjdk.org Fri Oct 28 13:25:58 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 28 Oct 2022 13:25:58 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline Message-ID: Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. overview The make it even further easier to distinguish different graphs and group, we can now rename them. To rename them: 1. click on a graph or group once to select it (does not have to be opened). 2. click a second time on the selected graph and wait 1-2 seconds. 3. now you can rename the graph rename_group rename_graph The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. # Implementierung The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. ------------- Commit messages: - start number at 1 (instead of 0) - addTitleCallback() - refactor - fix renaming - getDisplayNameChangedEvent() added - rename groups and graphs - JDK-8290063: IGV: Give the graphs a unique number in the outline Changes: https://git.openjdk.org/jdk/pull/10873/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10873&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290063 Stats: 207 lines in 9 files changed: 146 ins; 31 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/10873.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10873/head:pull/10873 PR: https://git.openjdk.org/jdk/pull/10873 From bkilambi at openjdk.org Fri Oct 28 13:35:32 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 28 Oct 2022 13:35:32 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 10:44:52 GMT, Bhavana Kilambi wrote: >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >>
Before this patch >> >> **aarch64:** >> ``` >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >>
>> >>
After this patch >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> >>
>> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. >> >>
Performance numbers >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> >>
>> >> There is no degradation when vectorization is disabled. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > 8288107: Auto-vectorization for integer min/max > > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. Hello, I have debugged the problem and the issue seems to be with the wrong datatype being considered by the MinV/MaxV nodes during autovectorization. The integer inputs to the min/max operation are being downconverted to "short" (thereby losing the information present in higher order bits) leading to incorrect result. Plus Java Math API for min/max operations does not support short/byte subword types. I am working on a patch to fix this and will raise a PR soon. Thank you. ------------- PR: https://git.openjdk.org/jdk/pull/9466 From jiefu at openjdk.org Fri Oct 28 13:57:36 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 28 Oct 2022 13:57:36 GMT Subject: RFR: 8295970: Add jdk_vector tests in GHA [v2] In-Reply-To: References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: On Fri, 28 Oct 2022 13:41:28 GMT, Erik Joelsson wrote: > I think you need to add at least one other label than `build` to this now to make sure the right people can have a say in the change. Done. Thanks @erikj79 . ------------- PR: https://git.openjdk.org/jdk/pull/10879 From bulasevich at openjdk.org Fri Oct 28 14:38:56 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 28 Oct 2022 14:38:56 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes [v5] In-Reply-To: References: Message-ID: > 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: comment update ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10392/files - new: https://git.openjdk.org/jdk/pull/10392/files/102e33c7..19be053f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10392&range=03-04 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10392.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10392/head:pull/10392 PR: https://git.openjdk.org/jdk/pull/10392 From roland at openjdk.org Fri Oct 28 14:44:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 28 Oct 2022 14:44:03 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some Message-ID: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> This was reported on 11 and is not reproducible with the current jdk. The reason is that the PhaseIdealLoop invocation before EA was changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't intended and this patch makes sure they behave the same. Once that's changed, the crash reproduces with the current jdk. The assert fires because PhaseIdealLoop::only_has_infinite_loops() returns false even though the IR only has infinite loops. There's a single loop nest and the inner most loop is an infinite loop. The current logic only looks at loops that are direct children of the root of the loop tree. It's not the first bug where PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite loop (8257574 was the previous one) and it's proving challenging to have PhaseIdealLoop::only_has_infinite_loops() handle corner cases robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once more. This time it goes over all children of the root of the loop tree, collects all controls for the loop and its inner loop. It then checks whether any control is a branch out of the loop and if it is whether it's not a NeverBranch. ------------- Commit messages: - test - fix - reproduce Changes: https://git.openjdk.org/jdk/pull/10904/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10904&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294217 Stats: 111 lines in 2 files changed: 96 ins; 3 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/10904.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10904/head:pull/10904 PR: https://git.openjdk.org/jdk/pull/10904 From roland at openjdk.org Fri Oct 28 15:35:38 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 28 Oct 2022 15:35:38 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: > This change is mostly the same I sent for review 3 years ago but was > never integrated: > > https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html > > The main difference is that, in the meantime, I submitted a couple of > refactoring changes extracted from the 2019 patch: > > 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr > 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses > > As a result, the current patch is much smaller (but still not small). > > The implementation is otherwise largely the same as in the 2019 > patch. I tried to remove some of the code duplication between the > TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic > shared in template methods. In the 2019 patch, interfaces were trusted > when types were constructed and I had added code to drop interfaces > from a type where they couldn't be trusted. This new patch proceeds > the other way around: interfaces are not trusted when a type is > constructed and code that uses the type must explicitly request that > they are included (this was suggested as an improvement by Vladimir > Ivanov I think). Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: build fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10901/files - new: https://git.openjdk.org/jdk/pull/10901/files/27986018..c8927519 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=00-01 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10901.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10901/head:pull/10901 PR: https://git.openjdk.org/jdk/pull/10901 From kvn at openjdk.org Fri Oct 28 17:51:25 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Oct 2022 17:51:25 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: <2vJqdKWjE62BrMdIERvrbif_x5Cxykbd6ojUTPXlJuQ=.045a86d5-299e-4a8d-82ee-772930ec3b55@github.com> On Thu, 27 Oct 2022 23:14:11 GMT, Vladimir Kozlov wrote: > EA does not adjust NSR (not_scalar_replaceable) state for referenced allocations. > In the test case object A is NSR because it merges with NULL object. But this state is not propagated to allocations it references. As result other allocations are marked scalar replaceable and related Load node is moved above guarding condition (where A object is checked for NULL). > EA should propagate NSR state. > > Thanks to @rwestrel who provided reproducer test case. > > Testing tier1-4, xcomp, stress. I ran replay and jar files from Tobias's example and looked on EA outputs. I found Allocation which was not marked NSR: 3310 Allocate === 4539 2045 2046 386 1 (...) [[ ... ]] rawptr:NotNull ( int:>=0, java/lang/Object:NotNull *, bool, top, bool ) Lucene90DocValuesProducer::getNumericValues @ bci:141 (line 647) Lucene90DocValuesProducer::getSortedNumeric @ bci:53 (line 1300) Lucene90DocValuesProducer::getSortedNumeric @ bci:19 (line 1287) AssertingDocValuesFormat$AssertingDocValuesProducer::getSortedNumeric @ bci:45 (line 270) PerFieldDocValuesFormat$FieldsReader::getSortedNumeric @ bci:27 (line 346) CodecReader::getSortedNumericDocValues @ bci:24 (line 176) DocValues::getSortedNumeric @ bci:2 (line 295) RangeFacetCounts::count @ bci:52 (line 122) !jvms: DirectMonotonicReader::getInstance @ bci:4 (line 101) Lucene90DocValuesProducer::getSortedNumeric @ bci:47 (line 1298) Lucene90DocValuesProducer::getSortedNumeric @ bci:19 (line 1287) AssertingDocValuesFormat$AssertingDocValuesProducer::getSortedNumeric @ bci:45 (line 270) PerFieldDocValuesFormat$FieldsReader::getSortedNumeric @ bci:27 (line 346) Co decReader::getSortedNumericDocValues @ bci:24 (line 176) DocValues::getSortedNumeric @ bci:2 (line 295) RangeFacetCounts::count @ bci:52 (line 122) LocalVar(491) [ 3310P [ 2056 ]] 3313 Proj === 3310 [[ 4542 2056 ]] #5 !jvms: DirectMonotonicReader::getInstance @ bci:14 (line 102) Lucene90DocValuesProducer::getSortedNumeric @ bci:47 (line 1298) Lucene90DocValuesProducer::getSortedNumeric @ bci:19 (line 1287) AssertingDocValuesFormat$AssertingDocValuesProducer::getSortedNumeric @ bci:45 (line 270) PerFieldDocValuesFormat$FieldsReader::getSortedNumeric @ bci:27 (line 346) CodecReader::getSortedNumericDocValues @ bci:24 (line 176) DocValues::getSortedNumeric @ bci:2 (line 295) RangeFacetCounts::count @ bci:52 (line 122) 2056 CheckCastPP === 3312 3313 [[ ... ]] #org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$15:NotNull:exact * Oop:org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$15:NotNull:exact * !jvms: String::compareTo @ bci:5 (line 141) TreeMap::getEntry @ bci:37 (line 350) TreeMap::get @ bci:2 (line 279) PerFieldDocValuesFormat$FieldsReader::getSortedNumeric @ bci:8 (line 345) CodecReader::getSortedNumericDocValues @ bci:24 (line 176) DocValues::getSortedNumeric @ bci:2 (line 295) RangeFacetCounts::count @ bci:52 (line 122) This allocation is also listed in bad dominators output. After fix the the replay passed and the allocation is marked as NSR: JavaObject(59) NoEscape(NoEscape) is NSR. is stored into field with NSR base JavaObject(59) NoEscape(NoEscape) NSR [ 6054F 6052F 6055F 6556F 6537F 2723F 5743F 2807F 5645F [ 3313 2056 6291 6290 5740 5642 4937 4776 3732 3549 2542 2336 2547 2341 ]] 3310 Allocate === 4539 2045 2046 386 1 (2717 4540 761 1 1 772 773 758 774 775 776 777 1 1 1 1 1 1 1 1 1 1 1 778 773 1 1 1 1 1 1 1 1 779 1 1 1 1 1 780 781 1 833 1 1 780 1 1 2047 3308 1 1 1 ) [[ 5280 4367 3075 6056 2054 3313 ]] rawptr:NotNull ( int:>=0, java/lang/Object:NotNull *, bool, top, bool ) Lucene90DocValuesProducer::getNumericValues @ bci:141 (line 647) Lucene90DocValuesProducer::getSortedNumeric @ bci:53 (line 1300) Lucene90DocValuesProducer::getSortedNumeric @ bci:19 (line 1287) AssertingDocValuesFormat$AssertingDocValuesProducer::getSortedNumeric @ bci:45 (line 270) PerFieldDocValuesFormat$FieldsReader::getSortedNumeric @ bci:27 (line 346) CodecReader::getSortedNumericDocValues @ bci:24 (line 176) DocValues::getSortedNumeric @ bci:2 (line 295) RangeFacetCounts::count @ bci:52 (line 122) !jvms: DirectMon otonicReader::getInstance @ bci:4 (line 101) Lucene90DocValuesProducer::getSortedNumeric @ bci:47 (line 1298) Lucene90DocValuesProducer::getSortedNumeric @ bci:19 (line 1287) AssertingDocValuesFormat$AssertingDocValuesProducer::getSortedNumeric @ bci:45 (line 270) PerFieldDocValuesFormat$FieldsReader::getSortedNumeric @ bci:27 (line 346) CodecReader::getSortedNumericDocValues @ bci:24 (line 176) DocValues::getSortedNumeric @ bci:2 (line 295) RangeFacetCounts::count @ bci:52 (line 122) ------------- PR: https://git.openjdk.org/jdk/pull/10894 From duke at openjdk.org Fri Oct 28 19:52:03 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 19:52:03 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Thu, 27 Oct 2022 05:10:59 GMT, Jatin Bhateja wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: > >> 173: // Choice of 1024 is arbitrary, need enough data blocks to amortize conversion overhead >> 174: // and not affect platforms without intrinsic support >> 175: int blockMultipleLength = (len/BLOCK_LENGTH) * BLOCK_LENGTH; > > Since Poly processes 16 byte chunks, a strength reduced version of above expression could be len & (~(BLOCK_LEN-1) I guess I got no issue with either version.. I was mostly thinking about code clarity? I think your version is 'more reliable' so just gonna switch it, thanks. > test/micro/org/openjdk/bench/javax/crypto/full/Poly1305DigestBench.java line 94: > >> 92: throw new RuntimeException(ex); >> 93: } >> 94: } > > On CLX patch shows performance regression of about 10% for block size 1024-2048+. > > CLX (Non-IFMA target) > > Baseline (JDK-20):- > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 2 3128928.978 ops/s > Poly1305DigestBench.digest 256 thrpt 2 1526452.083 ops/s > Poly1305DigestBench.digest 1024 thrpt 2 509267.401 ops/s > Poly1305DigestBench.digest 2048 thrpt 2 305784.922 ops/s > Poly1305DigestBench.digest 4096 thrpt 2 142175.885 ops/s > Poly1305DigestBench.digest 8192 thrpt 2 72142.906 ops/s > Poly1305DigestBench.digest 16384 thrpt 2 36357.000 ops/s > Poly1305DigestBench.digest 1048576 thrpt 2 676.142 ops/s > > > Withopt: > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 2 3136204.416 ops/s > Poly1305DigestBench.digest 256 thrpt 2 1683221.124 ops/s > Poly1305DigestBench.digest 1024 thrpt 2 457432.172 ops/s > Poly1305DigestBench.digest 2048 thrpt 2 277563.817 ops/s > Poly1305DigestBench.digest 4096 thrpt 2 149393.357 ops/s > Poly1305DigestBench.digest 8192 thrpt 2 79463.734 ops/s > Poly1305DigestBench.digest 16384 thrpt 2 41083.730 ops/s > Poly1305DigestBench.digest 1048576 thrpt 2 705.419 ops/s Odd, I measured it on `11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz`, will go again ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 20:23:41 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 20:23:41 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Thu, 27 Oct 2022 09:33:32 GMT, Jatin Bhateja wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 849: > >> 847: jcc(Assembler::less, L_process16Loop); >> 848: >> 849: poly1305_process_blocks_avx512(input, length, > > Since entire code is based on 512 bit encoding misalignment penalty may be costly here. A scalar peel handling (as done in tail) for input portion before a 64 byte aligned address could further improve the performance for large block sizes. Hmm.. interesting. Is this for loading? `evmovdquq` vs `evmovdqaq`? I was actually looking at using evmovdqaq but there is no encoding for it yet (And just looking now on uops.info, they seem to have identical timings? perhaps their measurements are off..). There are quite a few optimizations I tried (and removed) here, but not this one.. Perhaps to have a record, while its relatively fresh in my mind.. since there is a 8-block (I deleted a 16-block vector multiply), one can have a peeled off version for just 256 as the minimum payload.. In that case we only need R^1..R^8, (not R^1..R^16). I also tried loop stride of 8 blocks instead of 16, but that gets quite bit slower (20ish%?).. There was also a version that did a much better interleaving of multiplication and loading of next message block into limbs.. There is potentially a better way to 'devolve' the vector loop at tail; ie. when 15-blocks are left, just do one more 8-block multiply, all the constants are already available.. I removed all of those eventually. Even then, the assembler code currently is already fairly complex. The extra pre-, post-processing and if cases, I was struggling to keep up myself. Maybe code cleanup would have helped, so it _is_ possible to bring some of that back in for extra 10+%? (There is a branch on my fork with that code) I guess that's my long way of saying 'I don't want to complicate the assembler loop'? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 20:39:44 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 20:39:44 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: invalidkeyexception and some review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/883be106..78fd8fd7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=04-05 Stats: 33 lines in 7 files changed: 5 ins; 1 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 21:06:18 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 21:06:18 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Mon, 24 Oct 2022 23:38:16 GMT, Sandhya Viswanathan wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/hotspot/cpu/x86/assembler_x86.cpp line 8306: > >> 8304: assert(dst != xnoreg, "sanity"); >> 8305: InstructionMark im(this); >> 8306: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); > > no_mask_reg should be set to true here as we are not setting the mask register here. done > src/hotspot/cpu/x86/stubRoutines_x86.cpp line 83: > >> 81: address StubRoutines::x86::_join_2_3_base64 = NULL; >> 82: address StubRoutines::x86::_decoding_table_base64 = NULL; >> 83: address StubRoutines::x86::_poly1305_mask_addr = NULL; > > Please also update the copyright year to 2022 for stubRoutines_x86.cpp and hpp files. done. (hpp seemed ok) > src/hotspot/cpu/x86/vm_version_x86.cpp line 925: > >> 923: _features &= ~CPU_AVX512_VBMI2; >> 924: _features &= ~CPU_AVX512_BITALG; >> 925: _features &= ~CPU_AVX512_IFMA; > > This should also be done under is_knights_family(). done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 21:06:19 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 21:06:19 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <4FY4SEodgFcdxFXvGWFJWHYCr1GD4nAktLa5SiyPcxM=.384b2818-b6c5-4523-8682-5b730d9ad036@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> <4FY4SEodgFcdxFXvGWFJWHYCr1GD4nAktLa5SiyPcxM=.384b2818-b6c5-4523-8682-5b730d9ad036@github.com> Message-ID: On Wed, 26 Oct 2022 15:47:28 GMT, vpaprotsk wrote: >> src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 806: >> >>> 804: evmovdquq(A0, Address(rsp, 64*0), Assembler::AVX_512bit); >>> 805: evmovdquq(A0, Address(rsp, 64*1), Assembler::AVX_512bit); >>> 806: evmovdquq(A0, Address(rsp, 64*2), Assembler::AVX_512bit); >> >> This is load from stack into A0. Did you intend to store A0 (cleanup) into stack local area here? I think the source and destination are mixed up here. > > Wow! Thank you for spotting this done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 21:06:21 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 21:06:21 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Thu, 27 Oct 2022 09:29:52 GMT, Jatin Bhateja wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2040: > >> 2038: >> 2039: address StubGenerator::generate_poly1305_processBlocks() { >> 2040: __ align64(); > > This can be replaced by __ align(CodeEntryAlignment); done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 21:06:21 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 21:06:21 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: <4AB7TAZwydDonBwfxasMLmgVIQuaLgMUxck7eCbzYxw=.a9062602-90d4-4bde-baff-629bea466527@github.com> On Thu, 27 Oct 2022 21:19:06 GMT, Jamil Nimeh wrote: >>> 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic. >>> >>> I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though. >> >> Do you suggest using white box APIs for CPU feature query during poly static initialization and perform multi block processing only for relevant platforms and keep the original implementation sacrosanct for other targets. VM does offer native white box primitives and currently its being used by tests infrastructure. > > No, going the WhiteBox route was not something I was thinking of. I sought feedback from a couple hotspot-knowledgable people about the use of WhiteBox APIs and both felt that it was not the right way to go. One said that WhiteBox is really for VM testing and not for these kinds of java classes. One idea I was trying to measure was to make the intrinsic (i.e. the while loop remains exactly the same, just moved to different =non-static= function): private void processMultipleBlocks(byte[] input, int offset, int length) { //, MutableIntegerModuloP A, IntegerModuloP R) { while (length >= BLOCK_LENGTH) { n.setValue(input, offset, BLOCK_LENGTH, (byte)0x01); a.setSum(n); // A += (temp | 0x01) a.setProduct(r); // A = (A * R) % p offset += BLOCK_LENGTH; length -= BLOCK_LENGTH; } } In principle, the java version would not get any slower (i.e. there is only one extra function jump). At the expense of the C++ glue getting more complex. In C++ I need to dig out using IR `(sun.security.util.math.intpoly.IntegerPolynomial.MutableElement)(this.a).limbs` then convert 5*26bit limbs into 3*44-bit limbs. The IR is very new to me so will take some time. (I think I found some AES code that does something similar). That said.. I thought this idea would had been perhaps a separate PR, if needed at all.. Digging limbs out is one thing, but also need to add asserts and safety. Mostly would be happy to just measure if its worth it. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 21:06:23 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 21:06:23 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Fri, 28 Oct 2022 19:46:33 GMT, vpaprotsk wrote: >> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: >> >>> 173: // Choice of 1024 is arbitrary, need enough data blocks to amortize conversion overhead >>> 174: // and not affect platforms without intrinsic support >>> 175: int blockMultipleLength = (len/BLOCK_LENGTH) * BLOCK_LENGTH; >> >> Since Poly processes 16 byte chunks, a strength reduced version of above expression could be len & (~(BLOCK_LEN-1) > > I guess I got no issue with either version.. I was mostly thinking about code clarity? I think your version is 'more reliable' so just gonna switch it, thanks. done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Oct 28 21:06:26 2022 From: duke at openjdk.org (vpaprotsk) Date: Fri, 28 Oct 2022 21:06:26 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Wed, 26 Oct 2022 15:27:55 GMT, vpaprotsk wrote: >> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 296: >> >>> 294: keyBytes[12] &= (byte)252; >>> 295: >>> 296: // This should be enabled, but Poly1305KAT would fail >> >> I'm on the fence about this change. I have no problem with it in basic terms. If we ever decided to make this a general purpose Mac in JCE then this would definitely be good to do. As of right now, the only consumer is ChaCha20 and it would submit a key through the process in the RFC. Seems really unlikely to run afoul of these checks, but admittedly not impossible. >> >> I would agree with @sviswa7 that we could examine this in a separate change and we could look at other approaches to getting around the KAT issue, perhaps some package-private based way to disable the check. As long as Poly1305 remains with package-private visibility, one could make another form of the constructor with a boolean that would disable this check and that is the constructor that the KAT would use. This is just an off-the-cuff idea, but one way we might get the best of both worlds. >> >> If we move this down the road then we should remove the commenting. We can refer back to this PR later. > > I think I will remove the check for now, dont want to hold up reviews. I wasn't sure how to 'inject a backdoor' to the commented out check either, or at least how to do it in an acceptable way. Your ideas do sound plausible, and if anyone does want this check, I can implement one of the ideas (package private boolean flag? turn it on in the test) while waiting for more reviews to come in. > > The comment about ChaCha being the only way in is also relevant, thanks. i.e. this is a private class today. I flipped-flopped on this.. I already had the code for the exception.. and already described the potential fix. So rather then remove the code, pushed the described fix. Its always easier to remove the extra field I added. Let me know what you think about the 'backdoor' field. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From jnimeh at openjdk.org Fri Oct 28 21:58:30 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Fri, 28 Oct 2022 21:58:30 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: <0xJMPRdK0h3UJBYxqeLMfp1baL8xoaUpNcAZOtrFLKo=.d5c1020e-9e61-4800-bb52-9adbdd17e19f@github.com> On Fri, 28 Oct 2022 21:03:32 GMT, vpaprotsk wrote: >> I think I will remove the check for now, dont want to hold up reviews. I wasn't sure how to 'inject a backdoor' to the commented out check either, or at least how to do it in an acceptable way. Your ideas do sound plausible, and if anyone does want this check, I can implement one of the ideas (package private boolean flag? turn it on in the test) while waiting for more reviews to come in. >> >> The comment about ChaCha being the only way in is also relevant, thanks. i.e. this is a private class today. > > I flipped-flopped on this.. I already had the code for the exception.. and already described the potential fix. So rather then remove the code, pushed the described fix. Its always easier to remove the extra field I added. Let me know what you think about the 'backdoor' field. Well, what you're doing achieves what we're looking for, thanks for making that change. I think I'd like to see that value set on construction and not be mutable from outside the object. Something like this: - place a `private final boolean checkWeakKey` up near where all the other fields are defined. - the no-args Poly1305 is implemented as `this(true)` - an additional constructor is created `Poly1305(boolean checkKey)` which sets `checkWeakKey` true or false as provided by the parameter. - in setRSVals you should be able to wrap lines 296-310 inside a single `if (checkWeakKey)` block. - In the Poly1305KAT the `new Poly1305()` becomes `new Poly1305(false)`. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Fri Oct 28 22:53:35 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Oct 2022 22:53:35 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: <7nrkbRbG7mrCoJ2dXQKyfXqjwcr8yDPSGqta5Bu07l8=.3cb0bcfc-0402-43c1-856c-62c31326ea0f@github.com> On Fri, 28 Oct 2022 08:30:56 GMT, Roland Westrelin wrote: >>> Looks reasonable to me. Have you verified that the failure with the replay file and jars from the initial crash doesn't reproduce either? >> >> I had the same question. I commented on the issue. It would be good to understand how the test code relates to the code that fails in Lucene and possibly Ben Manes' Caffeine. Maybe a short explanation what is A, B, C in Lucene's code and which loop is affected. > >> Looks reasonable to me. Have you verified that the failure with the replay file and jars from the initial crash doesn't reproduce either? > > I tried the replay from the bug and the crash is gone with this fix. Thank you, @rwestrel, for review and additional testing. ------------- PR: https://git.openjdk.org/jdk/pull/10894 From kvn at openjdk.org Sat Oct 29 00:04:34 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 29 Oct 2022 00:04:34 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v3] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 08:13:06 GMT, Yi Yang wrote: >> Hi can I have a review for this fix? LoadBNode::Ideal crashes after performing GVN right after EA. The bad IR is as follows: >> >> ![image](https://user-images.githubusercontent.com/5010047/183106710-3a518e5e-0b59-4c3c-aba4-8b6fcade3519.png) >> >> The memory input of Load#971 is Phi#1109 and the address input of Load#971 is AddP whose object base is CheckCastPP#335: >> >> The type of Phi#1109 is `byte[int:>=0]:exact+any *` while `byte[int:8]:NotNull:exact+any *,iid=177` is the type of CheckCastPP#335 due to EA, they have different alias index, that's why we hit the assertion at L226: >> >> https://github.com/openjdk/jdk/blob/b17a745d7f55941f02b0bdde83866aa5d32cce07/src/hotspot/share/opto/memnode.cpp#L207-L226 >> (t is `byte[int:>=0]:exact+any *`, t_adr is `byte[int:8]:NotNull:exact+any *,iid=177`). >> >> There is a long story. In the beginning, LoadB#971 is generated at array_copy_forward, and GVN transformed it iteratively: >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1109 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> In this case, we get alias index 5 from address input AddP#969, and step it through MergeMem#1046, we found Phi#1109 then, that's why LoadB->in(Mem) is changed from MergeMem#1046 to Phi#1109 (Which finally leads to crash). >> >> 1046 MergeMem === _ 1 160 389 389 1109 1 1 389 1 1 1 1 1 1 1 1 1 1 1 1 1 709 709 709 709 882 888 894 190 190 912 191 [[ 1025 1021 1017 1013 1009 1005 1002 1001 998 996 991 986 981 976 971 966 962 961 960 121 122 123 124 1027 ]] >> >> >> After applying this patch, some related nodes are pushed into the GVN worklist, before stepping through MergeMem#1046, the address input is already changed to AddP#473. i.e., we get alias index 32 from address input AddP#473, and step it through MergeMem#1046, we found StoreB#191 then,LoadB->in(Mem) is changed from MergeMem#1046 to StoreB#191. >> >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 969 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 1046 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 1115 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 468 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> 971 LoadB === 390 191 473 [[ 972 ]] @byte[int:8]:NotNull:exact+any *,iid=177, idx=32; #byte !jvms: String::coder @ bci:0 (line 4540) String::getBytes @ bci:1 (line 4453) StringConcatHelper::prepend @ bci:21 (line 354) StringConcatHelper::simpleConcat @ bci:81 (line 425) DirectMethodHandle$Holder::invokeStatic @ bci:11 DelegatingMethodHandle$Holder::reinvoke_L @ bci:14 Invokers$Holder::linkToTargetMethod @ bci:6 Test::test @ bci:121 (line 22) >> ... >> >> The well-formed IR looks like this: >> ![image](https://user-images.githubusercontent.com/5010047/183239456-7096ea66-6fca-4c84-8f46-8c42d10b686a.png) >> >> Thanks for your patience. > > Yi Yang has updated the pull request incrementally with two additional commits since the last revision: > > - fix > - always clone the Phi with address type So the issue is that the chain of casts in `MemNode::optimize_memory_chain()` can't convert bottom array type `byte[int:>=0]:exact+any*` to instance array type `byte[int:8]:NotNull:exact+any *,iid=177`. Mostly because it does not adjust array's parameters. This cast chain simply does not work for arrays. You removed the cast chain. But you loose check for case when types are really not compatible. `PhiNode::split_out_instance()` does not take into account the current Phi's type when it creates new Phi based on Address type: `PhiNode *nphi = slice_memory(at);` Instead I think you need additional casts specifically for arrays (following code is simplified example) before comparing types: if (t_opp->isa_aryptr() && t->isa_aryptr() && (t_opp->elem() == t->elem())) { t->is_aryptr()->cast_to_size(t_opp->isa_aryptr()->size())->cast_to_stable(t_oop->is_stable()); } ------------- PR: https://git.openjdk.org/jdk/pull/9777 From vlivanov at openjdk.org Sat Oct 29 00:16:14 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 29 Oct 2022 00:16:14 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: <50d-zariA__s7xBrD-gRjTxc0rDpWU4xw7_VhqbMjoA=.cb87e513-43c2-4564-9617-f4443625750e@github.com> On Fri, 28 Oct 2022 15:35:38 GMT, Roland Westrelin wrote: >> This change is mostly the same I sent for review 3 years ago but was >> never integrated: >> >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html >> >> The main difference is that, in the meantime, I submitted a couple of >> refactoring changes extracted from the 2019 patch: >> >> 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr >> 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses >> >> As a result, the current patch is much smaller (but still not small). >> >> The implementation is otherwise largely the same as in the 2019 >> patch. I tried to remove some of the code duplication between the >> TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic >> shared in template methods. In the 2019 patch, interfaces were trusted >> when types were constructed and I had added code to drop interfaces >> from a type where they couldn't be trusted. This new patch proceeds >> the other way around: interfaces are not trusted when a type is >> constructed and code that uses the type must explicitly request that >> they are included (this was suggested as an improvement by Vladimir >> Ivanov I think). > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > build fix Thanks, Roland! Overall, looks very good. Submitted hs-tier1 - hs-tier4 testing. (Earlier, it went through hs-tier1 - hs-tier8 without new failures.) Some minor comments/suggestions follow. src/hotspot/share/ci/ciInstanceKlass.cpp line 735: > 733: GrowableArray* result = NULL; > 734: GUARDED_VM_ENTRY( > 735: InstanceKlass* ik = get_instanceKlass(); Does it make sense to cache the result on `ciInstanceKlass` instance? src/hotspot/share/ci/ciObjectFactory.cpp line 160: > 158: InstanceKlass* ik = vmClasses::name(); \ > 159: ciEnv::_##name = get_metadata(ik)->as_instance_klass(); \ > 160: Array* interfaces = ik->transitive_interfaces(); \ What's the purpose of interface-related part of the code? src/hotspot/share/opto/type.cpp line 572: > 570: > 571: TypeAryPtr::_array_interfaces = new TypePtr::InterfaceSet(); > 572: GrowableArray* array_interfaces = ciArrayKlass::interfaces(); Maybe move the code into a constructor or a factory method? After that, the only user of `TypePtr::InterfaceSet::add()` will be `TypePtr::interfaces()`. It would be nice to make `TypePtr::InterfaceSet` immutable and cache query results (`InterfaceSet::is_loaded() ` and `InterfaceSet::exact_klass()`). src/hotspot/share/opto/type.cpp line 4840: > 4838: } > 4839: interfaces = this_interfaces.intersection_with(tp_interfaces); > 4840: return TypeInstPtr::make(ptr, ciEnv::current()->Object_klass(), interfaces, false, NULL,offset, instance_id, speculative, depth); > NULL,offset missing space src/hotspot/share/opto/type.cpp line 5737: > 5735: // below the centerline when the superclass is exact. We need to > 5736: // do the same here. > 5737: if (klass()->equals(ciEnv::current()->Object_klass()) && this_interfaces.intersection_with(tp_interfaces).eq(this_interfaces) && !klass_is_exact()) { > this_interfaces.intersection_with(tp_interfaces).eq(this_interfaces) Maybe a case for a helper method `InterfaceSet::contains(InterfaceSet)`? src/hotspot/share/opto/type.cpp line 5861: > 5859: bool klass_is_exact = ik->is_final(); > 5860: if (!klass_is_exact && > 5861: deps != NULL && UseUniqueSubclasses) { Please, put `UseUniqueSubclasses` guard at the top of the method. src/hotspot/share/opto/type.hpp line 1154: > 1152: // Respects UseUniqueSubclasses. > 1153: // If the klass is final, the resulting type will be exact. > 1154: static const TypeOopPtr* make_from_klass(ciKlass* klass, bool trust_interface = false) { I'd suggest to use an enum (`trust_interfaces`/`ignore_interfaces`) instead of a `bool`, so the intention is clear at call sites. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From vlivanov at openjdk.org Sat Oct 29 00:48:11 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 29 Oct 2022 00:48:11 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: <_pAbhYV9SP1GoBHAawYCCXpQ6ioF68k88oXRmd4MdEo=.91c06003-dc58-452f-a997-ec12713b9788@github.com> On Thu, 27 Oct 2022 23:14:11 GMT, Vladimir Kozlov wrote: > EA does not adjust NSR (not_scalar_replaceable) state for referenced allocations. > In the test case object A is NSR because it merges with NULL object. But this state is not propagated to allocations it references. As result other allocations are marked scalar replaceable and related Load node is moved above guarding condition (where A object is checked for NULL). > EA should propagate NSR state. > > Thanks to @rwestrel who provided reproducer test case. > > Testing tier1-4, xcomp, stress. Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/10894 From alanb at openjdk.org Sat Oct 29 08:02:20 2022 From: alanb at openjdk.org (Alan Bateman) Date: Sat, 29 Oct 2022 08:02:20 GMT Subject: RFR: 8295970: Add jdk_vector tests in GHA In-Reply-To: References: Message-ID: <-48VMcPym9w9jDnxkf2UTy7qFEDZwCR4PWgwtlroIiI=.bb3f67f1-e912-4e59-aaa5-25b32c062763@github.com> On Fri, 28 Oct 2022 07:21:05 GMT, Jie Fu wrote: > Good suggestion! > And the `jdk_vector_sanity` test group had been added. In general, running a few fast sanity tests from several areas in tier1 seems a good idea. Having test lists in the TEST.group files isn't very appealing as they easily get out of the sync with the tests in the tree. I realise there are already some test lists in both the hotspot and jdk TEST.groups but it feels like it needs something better so that RunTests.gmk/jtreg can select the sanity tests to run. This is not an objection to this change proposed here, just a comment that this type of configuration is annoying to maintain. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From kvn at openjdk.org Sat Oct 29 12:34:39 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 29 Oct 2022 12:34:39 GMT Subject: RFR: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 23:14:11 GMT, Vladimir Kozlov wrote: > EA does not adjust NSR (not_scalar_replaceable) state for referenced allocations. > In the test case object A is NSR because it merges with NULL object. But this state is not propagated to allocations it references. As result other allocations are marked scalar replaceable and related Load node is moved above guarding condition (where A object is checked for NULL). > EA should propagate NSR state. > > Thanks to @rwestrel who provided reproducer test case. > > Testing tier1-4, xcomp, stress. Thank you, Vladimir. ------------- PR: https://git.openjdk.org/jdk/pull/10894 From kvn at openjdk.org Sat Oct 29 12:36:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 29 Oct 2022 12:36:02 GMT Subject: Integrated: 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 23:14:11 GMT, Vladimir Kozlov wrote: > EA does not adjust NSR (not_scalar_replaceable) state for referenced allocations. > In the test case object A is NSR because it merges with NULL object. But this state is not propagated to allocations it references. As result other allocations are marked scalar replaceable and related Load node is moved above guarding condition (where A object is checked for NULL). > EA should propagate NSR state. > > Thanks to @rwestrel who provided reproducer test case. > > Testing tier1-4, xcomp, stress. This pull request has now been integrated. Changeset: 8aa1526b Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/8aa1526b443025b8606a3668262f46a9cb6ea6f6 Stats: 145 lines in 3 files changed: 135 ins; 0 del; 10 mod 8285835: SIGSEGV in PhaseIdealLoop::build_loop_late_post_work Reviewed-by: roland, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/10894 From bulasevich at openjdk.org Sat Oct 29 14:11:46 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Sat, 29 Oct 2022 14:11:46 GMT Subject: RFR: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes [v5] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 14:38:56 GMT, Boris Ulasevich wrote: >> 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > comment update thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10392 From bulasevich at openjdk.org Sat Oct 29 14:11:46 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Sat, 29 Oct 2022 14:11:46 GMT Subject: Integrated: 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes In-Reply-To: References: Message-ID: On Thu, 22 Sep 2022 14:30:10 GMT, Boris Ulasevich wrote: > 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes This pull request has now been integrated. Changeset: f3ca0cab Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/f3ca0cab75f2faf9ec88f7a380490c9589a27102 Stats: 56 lines in 5 files changed: 28 ins; 20 del; 8 mod 8293999: [JVMCI] need support for aligned constants in generated code larger than 8 bytes Reviewed-by: dlong, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/10392 From serb at openjdk.org Sat Oct 29 23:12:25 2022 From: serb at openjdk.org (Sergey Bylokhov) Date: Sat, 29 Oct 2022 23:12:25 GMT Subject: RFR: 8295970: Add jdk_vector tests in GHA [v2] In-Reply-To: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: <1INzCtXMkzWk2GmbTkXaoP1CAeCgNLdvsWtfwBdcES0=.3b35c7d1-32c0-4422-82a9-e38790b3db3a@github.com> On Fri, 28 Oct 2022 07:19:31 GMT, Jie Fu wrote: >> Hi all, >> >> As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add jdk_vector_sanity test group > - Merge branch 'master' into JDK-8295970 > - Revert changes in test.yml > - 8295970: Add jdk_vector tests in GHA What about possibility to run additional group of the test by passing somehow the name of group to the GA, via label or via /test cmd, or via parameter to the specific task in GA? ------------- PR: https://git.openjdk.org/jdk/pull/10879 From jiefu at openjdk.org Sun Oct 30 12:07:25 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sun, 30 Oct 2022 12:07:25 GMT Subject: RFR: 8295970: Add jdk_vector tests in GHA In-Reply-To: <-48VMcPym9w9jDnxkf2UTy7qFEDZwCR4PWgwtlroIiI=.bb3f67f1-e912-4e59-aaa5-25b32c062763@github.com> References: <-48VMcPym9w9jDnxkf2UTy7qFEDZwCR4PWgwtlroIiI=.bb3f67f1-e912-4e59-aaa5-25b32c062763@github.com> Message-ID: On Sat, 29 Oct 2022 07:58:49 GMT, Alan Bateman wrote: > I realise there are already some test lists in both the hotspot and jdk TEST.groups but it feels like it needs something better so that RunTests.gmk/jtreg can select the sanity tests to run. Thanks @AlanBateman for this comment. Is there an existing example to follow? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From jiefu at openjdk.org Sun Oct 30 12:19:34 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sun, 30 Oct 2022 12:19:34 GMT Subject: RFR: 8295970: Add jdk_vector tests in GHA [v2] In-Reply-To: <1INzCtXMkzWk2GmbTkXaoP1CAeCgNLdvsWtfwBdcES0=.3b35c7d1-32c0-4422-82a9-e38790b3db3a@github.com> References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> <1INzCtXMkzWk2GmbTkXaoP1CAeCgNLdvsWtfwBdcES0=.3b35c7d1-32c0-4422-82a9-e38790b3db3a@github.com> Message-ID: On Sat, 29 Oct 2022 23:08:38 GMT, Sergey Bylokhov wrote: > What about possibility to run additional group of the test by passing somehow the name of group to the GA, via label or via /test cmd, or via parameter to the specific task in GA? Well, it sounds good to run specific tests via label or /test cmd. However, if the vector api tests are OK to be added in tier1 I think it's fine to run them in GHA directly. Maybe, we can implement this feature in the future not only for vector api tests but also for other tests too. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From eastigeevich at openjdk.org Sun Oct 30 17:04:30 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Sun, 30 Oct 2022 17:04:30 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: <7xYMYWazVI1VHQK0g9GafFqeH9kA-EWZfsOHhy4cIXs=.8ad5ab4f-6e3a-4b25-92a7-c027a390e018@github.com> References: <7xYMYWazVI1VHQK0g9GafFqeH9kA-EWZfsOHhy4cIXs=.8ad5ab4f-6e3a-4b25-92a7-c027a390e018@github.com> Message-ID: On Tue, 4 Oct 2022 12:58:45 GMT, Boris Ulasevich wrote: >>> > What is the performance impact of making several of the methods virtual? >>> >>> Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. >> >> I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. > >> > > What is the performance impact of making several of the methods virtual? >> > >> > >> > Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. >> >> I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. > > Right. With counters in virtual methods, I see that reading debug information is less frequent than writing. Anyway. Let me rewrite code without virtual functions. @bulasevich, Could you please add gtest unit tests checking `CompressedSparseDataWriteStream`/`CompressedSparseDataReadStream`? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Sun Oct 30 17:04:30 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Sun, 30 Oct 2022 17:04:30 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: <1FjGNLdpMESjMmA9aKawC6A_ilIx_er8LH8hiWt0q4Q=.e4988cf1-7030-4397-8087-d290297015ad@github.com> On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.hpp line 115: > 113: }; > 114: > 115: class CompressedBitStream : public ResourceObj { Maybe it is better `CompressedSparseData`? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Sun Oct 30 17:12:11 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Sun, 30 Oct 2022 17:12:11 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.hpp line 135: > 133: CompressedSparseDataReadStream(u_char* buffer, int position) : CompressedBitStream(buffer, position) {} > 134: > 135: void set_position(int pos) { Are there uses of it? If no, let's remove it. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Sun Oct 30 17:31:22 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Sun, 30 Oct 2022 17:31:22 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: <7xYMYWazVI1VHQK0g9GafFqeH9kA-EWZfsOHhy4cIXs=.8ad5ab4f-6e3a-4b25-92a7-c027a390e018@github.com> References: <7xYMYWazVI1VHQK0g9GafFqeH9kA-EWZfsOHhy4cIXs=.8ad5ab4f-6e3a-4b25-92a7-c027a390e018@github.com> Message-ID: On Tue, 4 Oct 2022 12:58:45 GMT, Boris Ulasevich wrote: >>> > What is the performance impact of making several of the methods virtual? >>> >>> Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. >> >> I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. > >> > > What is the performance impact of making several of the methods virtual? >> > >> > >> > Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. >> >> I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. > > Right. With counters in virtual methods, I see that reading debug information is less frequent than writing. Anyway. Let me rewrite code without virtual functions. @bulasevich, BTW, why do we have a lot of 0s? Is it a normal situation? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Sun Oct 30 17:31:24 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Sun, 30 Oct 2022 17:31:24 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/compressedStream.cpp line 117: > 115: > 116: > 117: bool CompressedSparseDataReadStream::read_zero() { If the last value written to a stream was 0, a reader would not know this is one 0 or eight 0s. Is there a guarantee that the number of reads will the same as the number of writes? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Sun Oct 30 17:56:14 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Sun, 30 Oct 2022 17:56:14 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/hotspot/share/code/debugInfo.hpp line 298: > 296: // debugging information. Used by ScopeDesc. > 297: > 298: class DebugInfoReadStream : public CompressedSparseDataReadStream { I don't think `DebugInfoReadStream`/`DebugInfoWriteStream` need public inheritance. The relation is more like composition. I would have implemented them like: class DebugInfoReadStream : private CompressedSparseDataReadStream { public: // we are using only needed functions from CompressedSparseDataReadStream. using CompressedSparseDataReadStream::buffer(); using CompressedSparseDataReadStream::read_int(); using ... }; Or template class DebugInfoReadStream { public: // define only needed functions which use a minimum number of functions from DataReadStream }; I prefer the templates because we can easily switch between different implementations of `DataReadStream`/DataWriteStream` without doing this kind of modifications. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Sun Oct 30 18:09:54 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Sun, 30 Oct 2022 18:09:54 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 12:04:32 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > minor renaming. adding encoding examples table src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 28: > 26: import sun.jvm.hotspot.debugger.*; > 27: > 28: public class CompressedSparseDataReadStream extends CompressedReadStream { This needs to be aligned with C++ code. Can we test the code? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From chagedorn at openjdk.org Mon Oct 31 07:55:31 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 31 Oct 2022 07:55:31 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some In-Reply-To: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: On Fri, 28 Oct 2022 14:34:42 GMT, Roland Westrelin wrote: > This was reported on 11 and is not reproducible with the current > jdk. The reason is that the PhaseIdealLoop invocation before EA was > changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of > loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't > intended and this patch makes sure they behave the same. Once that's > changed, the crash reproduces with the current jdk. > > The assert fires because PhaseIdealLoop::only_has_infinite_loops() > returns false even though the IR only has infinite loops. There's a > single loop nest and the inner most loop is an infinite loop. The > current logic only looks at loops that are direct children of the root > of the loop tree. It's not the first bug where > PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite > loop (8257574 was the previous one) and it's proving challenging to > have PhaseIdealLoop::only_has_infinite_loops() handle corner cases > robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once > more. This time it goes over all children of the root of the loop > tree, collects all controls for the loop and its inner loop. It then > checks whether any control is a branch out of the loop and if it is > whether it's not a NeverBranch. That looks reasonable to me. src/hotspot/share/opto/loopnode.cpp line 4183: > 4181: > 4182: #ifdef ASSERT > 4183: bool PhaseIdealLoop::only_has_infinite_loops() { > This time it goes over all children of the root of the loop tree, collects all controls for the loop and its inner loop. It then checks whether any control is a branch out of the loop and if it is whether it's not a NeverBranch. Maybe you can add this summary as a comment here. src/hotspot/share/opto/loopnode.cpp line 4209: > 4207: for (uint i = 0; i < wq.size(); ++i) { > 4208: Node* c = wq.at(i); > 4209: if (c->isa_MultiBranch()) { Can be changed to `is_MultiBranch()` as you do not need the casted type. test/hotspot/jtreg/compiler/loopopts/TestInfiniteLoopNest.java line 55: > 53: public static void main(String[] p) throws Exception { > 54: Thread thread = new Thread() { > 55: public void run() { Indentation is wrong here. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10904 From chagedorn at openjdk.org Mon Oct 31 08:26:25 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 31 Oct 2022 08:26:25 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v6] In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: <7eV6OcVY0w8MzR-qUTg4glsxSQ3ig5OkJ8ymwhqvmlc=.716d2c2b-fc10-462f-ae37-14335dafff84@github.com> > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 85 commits: - Merge branch 'master' into JDK-8280378 - Merge branch 'master' into JDK-8280378 - Fix TestVectorConditionalMove - Merge branch 'master' into JDK-8280378 - Hao's patch to address review comments - Roberto's review comments - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java Co-authored-by: Roberto Casta?eda Lozano - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java Co-authored-by: Roberto Casta?eda Lozano - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java Co-authored-by: Roberto Casta?eda Lozano - Merge branch 'master' into JDK-8280378 - ... and 75 more: https://git.openjdk.org/jdk/compare/9b9be88b...8d330790 ------------- Changes: https://git.openjdk.org/jdk/pull/10695/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10695&range=05 Stats: 9484 lines in 154 files changed: 7149 ins; 1598 del; 737 mod Patch: https://git.openjdk.org/jdk/pull/10695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10695/head:pull/10695 PR: https://git.openjdk.org/jdk/pull/10695 From bkilambi at openjdk.org Mon Oct 31 12:04:30 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 31 Oct 2022 12:04:30 GMT Subject: RFR: 8295276: AArch64: Add backend support for half float conversion intrinsics In-Reply-To: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> References: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> Message-ID: On Thu, 20 Oct 2022 14:33:33 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for library intrinsics that implement conversions between half-precision and single-precision floats. > > Ran the following benchmarks to assess the performance with this patch - > > org.openjdk.bench.java.math.Fp16ConversionBenchmark.floatToFloat16 org.openjdk.bench.java.math.Fp16ConversionBenchmark.float16ToFloat > > The performance (ops/ms) gain with the patch on an ARM NEON machine is shown below - > > > Benchmark Gain > Fp16ConversionBenchmark.float16ToFloat 3.42 > Fp16ConversionBenchmark.floatToFloat16 5.85 Could anyone please take a look at this PR and give their feedback ? Thank you .. ------------- PR: https://git.openjdk.org/jdk/pull/10796 From smonteith at openjdk.org Mon Oct 31 15:08:33 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Mon, 31 Oct 2022 15:08:33 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: > The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. > > Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. > > Running on an SVE2 enabled system, I ran the following benchmarks: > > org.openjdk.bench.java.lang.Integers > org.openjdk.bench.java.lang.Longs > > The time for each operation reduced to 56% to 72% of the original run time: > > > Benchmark Result error Unit % against non-SVE2 > Integers.expand 2.106 0.011 us/op > Integers.expand-SVE 1.431 0.009 us/op 67.95% > Longs.expand 2.606 0.006 us/op > Longs.expand-SVE 1.46 0.003 us/op 56.02% > Integers.compress 1.982 0.004 us/op > Integers.compress-SVE 1.427 0.003 us/op 72.00% > Longs.compress 2.501 0.002 us/op > Longs.compress-SVE 1.441 0.003 us/op 57.62% > > > These methods can bed specifically tested with: > `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/cpu/aarch64/aarch64.ad Correct slight formatting error. Co-authored-by: Eric Liu ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10537/files - new: https://git.openjdk.org/jdk/pull/10537/files/1a0a9427..8b13dabb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10537.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10537/head:pull/10537 PR: https://git.openjdk.org/jdk/pull/10537 From smonteith at openjdk.org Mon Oct 31 15:08:33 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Mon, 31 Oct 2022 15:08:33 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 09:40:25 GMT, Andrew Dinn wrote: >> I suppose the predicate-stmt is not needed here, since the check has already been done in `match_rule_supported()` helper. > > That's a good point. x86 rules all appear to omit any checks that appear in `match_rule_supported` (in most cases they have no predicate, in others they have a predicate that includes a further sub-constraint). > > For AArch64 the predicate test in `match_rule_supported` is omitted for `OP_OnSpinWait` but retained for `Op_CacheWB`, `CacheWBPreSync` and `CacheWBPostSync`. We should probably make this consistent by removing the repeat predicates for those last three cases as well. The CompressBits and ExpandBits nodes won't even be emitted, so the predicates would be redundant here - this is a consequence of them being intrinsics - otherwise it would be the Java implementations of the methods. I can deal with the CacheWBs in a separate PR. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From smonteith at openjdk.org Mon Oct 31 15:36:24 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Mon, 31 Oct 2022 15:36:24 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Mon, 31 Oct 2022 15:02:56 GMT, Stuart Monteith wrote: >> That's a good point. x86 rules all appear to omit any checks that appear in `match_rule_supported` (in most cases they have no predicate, in others they have a predicate that includes a further sub-constraint). >> >> For AArch64 the predicate test in `match_rule_supported` is omitted for `OP_OnSpinWait` but retained for `Op_CacheWB`, `CacheWBPreSync` and `CacheWBPostSync`. We should probably make this consistent by removing the repeat predicates for those last three cases as well. > > The CompressBits and ExpandBits nodes won't even be emitted, so the predicates would be redundant here - this is a consequence of them being intrinsics - otherwise it would be the Java implementations of the methods. > I can deal with the CacheWBs in a separate PR. I've opened https://bugs.openjdk.org/browse/JDK-8296132 , PR to follow. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From rkennke at openjdk.org Mon Oct 31 17:40:03 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 31 Oct 2022 17:40:03 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() Message-ID: In aarch64_enc_fast_unlock() (aarch64.ad) we have this piece of code: __ ldr(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); __ tbnz(disp_hdr, exact_log2(markWord::monitor_value), object_has_monitor); The tbnz uses the wrong register - it should really use tmp. disp_hdr has been loaded with the displaced header of the stack-lock, which would never have its monitor bits set, thus the branch will always take the slow path. In this common case, it is only a performance nuisance. In the case of !UseHeavyMonitors it is even worse, then disp_hdr will be unitialized, and we are facing a correctness problem. As far as I can tell, the problem dates back to when aarch64 C2 parts have been added to OpenJDK. Testing: - [x] tier1 - [ ] tier2 - [ ] tier3 - [ ] tier4 ------------- Commit messages: - 8296136: Use correct register in aarch64_enc_fast_unlock() Changes: https://git.openjdk.org/jdk/pull/10921/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10921&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296136 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10921.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10921/head:pull/10921 PR: https://git.openjdk.org/jdk/pull/10921 From aph at openjdk.org Mon Oct 31 17:43:51 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 31 Oct 2022 17:43:51 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() In-Reply-To: References: Message-ID: <4ppRH10hnrOmOxtiC1RfVpu5i25nvTPbmGP0uorUJkA=.c6734770-d6bd-4f76-be67-62269a5c0d45@github.com> On Mon, 31 Oct 2022 17:31:31 GMT, Roman Kennke wrote: > In aarch64_enc_fast_unlock() (aarch64.ad) we have this piece of code: > > > __ ldr(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); > __ tbnz(disp_hdr, exact_log2(markWord::monitor_value), object_has_monitor); > > > The tbnz uses the wrong register - it should really use tmp. disp_hdr has been loaded with the displaced header of the stack-lock, which would never have its monitor bits set, thus the branch will always take the slow path. In this common case, it is only a performance nuisance. In the case of !UseHeavyMonitors it is even worse, then disp_hdr will be unitialized, and we are facing a correctness problem. > > As far as I can tell, the problem dates back to when aarch64 C2 parts have been added to OpenJDK. > > Testing: > - [x] tier1 > - [ ] tier2 > - [ ] tier3 > - [ ] tier4 Ouch! Yes, thanks. I just checked the code against x86, which confirms your analysis. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/10921 From adinn at openjdk.org Mon Oct 31 21:38:26 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 31 Oct 2022 21:38:26 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Mon, 31 Oct 2022 15:08:33 GMT, Stuart Monteith wrote: >> The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. >> >> Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. >> >> Running on an SVE2 enabled system, I ran the following benchmarks: >> >> org.openjdk.bench.java.lang.Integers >> org.openjdk.bench.java.lang.Longs >> >> The time for each operation reduced to 56% to 72% of the original run time: >> >> >> Benchmark Result error Unit % against non-SVE2 >> Integers.expand 2.106 0.011 us/op >> Integers.expand-SVE 1.431 0.009 us/op 67.95% >> Longs.expand 2.606 0.006 us/op >> Longs.expand-SVE 1.46 0.003 us/op 56.02% >> Integers.compress 1.982 0.004 us/op >> Integers.compress-SVE 1.427 0.003 us/op 72.00% >> Longs.compress 2.501 0.002 us/op >> Longs.compress-SVE 1.441 0.003 us/op 57.62% >> >> >> These methods can bed specifically tested with: >> `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` > > Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/aarch64/aarch64.ad > > Correct slight formatting error. > > Co-authored-by: Eric Liu Still good ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10537 From adinn at openjdk.org Mon Oct 31 21:38:27 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 31 Oct 2022 21:38:27 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: <9Mm--KETvdp6KlBFW-BIWjDiOc09wUkWD58SMcX_agQ=.812962d4-4b04-4e6c-94e8-c7750a45be68@github.com> On Mon, 31 Oct 2022 15:32:34 GMT, Stuart Monteith wrote: >> The CompressBits and ExpandBits nodes won't even be emitted, so the predicates would be redundant here - this is a consequence of them being intrinsics - otherwise it would be the Java implementations of the methods. >> I can deal with the CacheWBs in a separate PR. > > I've opened https://bugs.openjdk.org/browse/JDK-8296132 , PR to follow. Thanks Stuart ------------- PR: https://git.openjdk.org/jdk/pull/10537