From duke at openjdk.org Fri Mar 1 00:27:53 2024 From: duke at openjdk.org (Joshua Cao) Date: Fri, 1 Mar 2024 00:27:53 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v5] In-Reply-To: References: <0mSC33e8Dm1pwOo_xlx48AwfkB1C9ZNIVqD8UdSW07U=.866a7c2a-59cf-4bab-8bda-dcd8a3f337de@github.com> Message-ID: On Thu, 29 Feb 2024 07:26:52 GMT, Emanuel Peter wrote: > One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. I ran `make CONF=linux-x86_64-server-fastdebug test TEST=all TEST_VM_OPTS=-XX:-TieredCompilation` on my Linux machine. I have 4 failures in `SctpChannel` and 3 failures in `CAInterop.java`, but they also fail on master branch so they should not be caused by this patch. Hopefully this adds a little more confidence. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17375#issuecomment-1972204471 From duke at openjdk.org Fri Mar 1 05:38:53 2024 From: duke at openjdk.org (Joshua Cao) Date: Fri, 1 Mar 2024 05:38:53 GMT Subject: Integrated: 8324790: ifnode::fold_compares_helper cleanup In-Reply-To: References: Message-ID: On Fri, 26 Jan 2024 23:31:00 GMT, Joshua Cao wrote: > I hope my assumptions in `filtered_int_type` are correct here: > > * we assert that `if_proj` is an `IfTrue` or `IfFalse`, so it is safe to assume `if_proj->_in` is an `IfNode` > * the 1'th input of a CmpNode is a BoolNode > * Tthe 1'th input of an IfNode is **not always a BoolNode**, it can be a constant. We need to leave this check in. > > We also remove a some of the if-checks in `compare_folds_cleanup` which seem unnecessary. > > Passes tier1 locally. This pull request has now been integrated. Changeset: 12404a5e Author: Joshua Cao Committer: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/12404a5efb3c45f72f54fda3238c72d5d15a30ee Stats: 64 lines in 1 file changed: 21 ins; 27 del; 16 mod 8324790: ifnode::fold_compares_helper cleanup Reviewed-by: chagedorn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/17601 From epeter at openjdk.org Fri Mar 1 05:43:53 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 05:43:53 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v5] In-Reply-To: References: <0mSC33e8Dm1pwOo_xlx48AwfkB1C9ZNIVqD8UdSW07U=.866a7c2a-59cf-4bab-8bda-dcd8a3f337de@github.com> Message-ID: On Fri, 1 Mar 2024 00:24:45 GMT, Joshua Cao wrote: >> @caojoshua it looks really good now! >> I'm running our internal testing again, will report back. >> >> One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? Because you now touched the logic around there we should make sure there are at least correctness tests. IR tests are probably basically impossible because it is the same number of Add/Sub nodes before and after the optimization. >> >> I'd like to have another Reviewer look over this as well, therefore: > >> One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? > > Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. > > I ran `make CONF=linux-x86_64-server-fastdebug test TEST=all TEST_VM_OPTS=-XX:-TieredCompilation` on my Linux machine. I have 4 failures in `SctpChannel` and 3 failures in `CAInterop.java`, but they also fail on master branch so they should not be caused by this patch. Hopefully this adds a little more confidence. @caojoshua I also ran our internal testing and it looks ok (only unrelated failures). But of course that is only on tests that we have, and if the other reassociations are not tested, then that helps little ;) > Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. Could you please add a result verification test per case of pre-existing reassociation? Otherwise I'm afraid it is hard to be sure you did not break those cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17375#issuecomment-1972548903 From duke at openjdk.org Fri Mar 1 05:51:02 2024 From: duke at openjdk.org (kuaiwei) Date: Fri, 1 Mar 2024 05:51:02 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 Message-ID: Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. ------------- Commit messages: - 8326983: Unused operands reported after JDK-8326135 Changes: https://git.openjdk.org/jdk/pull/18075/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8326983 Stats: 553 lines in 2 files changed: 0 ins; 553 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From jbhateja at openjdk.org Fri Mar 1 06:01:45 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 06:01:45 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo src/hotspot/cpu/x86/assembler_x86.cpp line 5146: > 5144: > 5145: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, Address src2, int vector_len) { > 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); What if vector length is 128 bit and target does not support AVX_IFMA ? AVX512_IFMA + AVX512_VL should still be still be sufficient to execute 52 bit MACs. src/hotspot/cpu/x86/assembler_x86.cpp line 5181: > 5179: > 5180: void Assembler::vpmadd52huq(XMMRegister dst, XMMRegister src1, Address src2, int vector_len) { > 5181: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); What if vector length is 128 bit and target does not support AVX_IFMA ? AVX512_IFMA + AVX512_VL should still be still be sufficient to execute 52 bit MACs. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508515255 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508514777 From jbhateja at openjdk.org Fri Mar 1 06:07:56 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 06:07:56 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo src/hotspot/cpu/x86/assembler_x86.cpp line 5156: > 5154: > 5155: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { > 5156: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); What if vector length is 128 bit and target does not support AVX_IFMA ? AVX512_IFMA + AVX512_VL should still be still be sufficient to execute 52 bit MACs. Please add appropriate assertions to explicitly check AVX512VL. src/hotspot/cpu/x86/assembler_x86.cpp line 5191: > 5189: > 5190: void Assembler::vpmadd52huq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { > 5191: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); Same as above. src/hotspot/cpu/x86/assembler_x86.cpp line 9101: > 9099: > 9100: void Assembler::vpunpckhqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 9101: assert(UseAVX > 0, "requires some form of AVX"); Add appropriate AVX512VL assertion. src/hotspot/cpu/x86/assembler_x86.cpp line 9115: > 9113: > 9114: void Assembler::vpunpcklqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 9115: assert(UseAVX > 0, "requires some form of AVX"); Add appropriate AVX512VL assertion ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508516820 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508518721 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508517680 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508517933 From jbhateja at openjdk.org Fri Mar 1 06:11:56 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 06:11:56 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo Changes requested by jbhateja (Reviewer). src/hotspot/cpu/x86/assembler_x86.cpp line 9115: > 9113: > 9114: void Assembler::vpunpcklqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 9115: assert(UseAVX > 0, "requires some form of AVX"); Add appropriate AVX512VL assertion ------------- PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1910377861 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508520924 From duke at openjdk.org Fri Mar 1 07:39:16 2024 From: duke at openjdk.org (kuaiwei) Date: Fri, 1 Mar 2024 07:39:16 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: Message-ID: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. kuaiwei has updated the pull request incrementally with one additional commit since the last revision: clean for other architecture ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18075/files - new: https://git.openjdk.org/jdk/pull/18075/files/3efe6bb8..faa8f949 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=00-01 Stats: 403 lines in 7 files changed: 1 ins; 401 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From jbhateja at openjdk.org Fri Mar 1 08:23:54 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 08:23:54 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo Hi @vamsi-parasa , apart from above assertion check modifications, patch looks good to me. src/hotspot/cpu/x86/assembler_x86.cpp line 5148: > 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > 5147: InstructionMark im(this); > 5148: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); uses_vl should be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 5157: > 5155: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { > 5156: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > 5157: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); uses_vl should be false. src/hotspot/cpu/x86/assembler_x86.cpp line 5183: > 5181: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > 5182: InstructionMark im(this); > 5183: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); uses_vl should be false. ------------- Changes requested by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1910555763 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508637115 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508637945 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508638146 From epeter at openjdk.org Fri Mar 1 12:43:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 12:43:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth Thanks for trying such a benchmark! I have a few ideas and questions now. 1. I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster. 2. You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic? 3. I think you want some code on the `a` path that does not require inlining, just some arithmetic. The longer the chain the better, as it creates large latency. But then you also want something after the if/min/max which has a high latency, so that branch speculation can actually make progress on something, whereas max/min would have to wait until it is finished computing. I actually have a r**egression case for the current CMove logic**, but it **would apply to your logic in some way I think as well**. See my `testCostDifference` below. Clean master: `IfMinMax.testCostDifference avgt 15 889118.284 ? 10638.421 ns/op` When I disable `PhaseIdealLoop::conditional_move`, without your patch: `IfMinMax.testCostDifference avgt 15 710629.583 ? 3232.237 ns/op` Your patch, with `PhaseIdealLoop::conditional_move` disabled: `IfMinMax.testCostDifference avgt 15 886518.663 ? 1801.308 ns/op` I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why. So I suspect you could now take my benchmark, and convert it into non-loop code, and then CMove would not kick in, but your conversion to Max/Min would apply instead. And then you could observe the same regression. Let me know what you think. Not sure if this regression is important enough, but we need to consider what to do about your patch, as well as the CMove logic that already exists. @Benchmark public void testCostDifference(Blackhole blackhole, BenchState state) { //int hits = 0; int x = 0xf0f0f0f0; // maybe instead use a random source that is different with every method call? for (int i = 0; i < 10_000; i++) { int a = (x ^ 0xffffffff) & 0x07ffffff; // cheap (note: mask affects probability) int h = x; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; int b = (h & 0x7fffffff); // expensive hashing sequence (node: mask affects probability) int m = (a < b) ? a : b; // The Min/Max //hits += (a > b) ? 1 : 0; //System.out.println("i: " + i + " hits: " + hits + " m: " + m + " a: " + a + " b: " + b); // Note: the hit probability can be adjusted by changing the masks // adding or removing the most significant bit has a change of // about a factor of 2. // The hashing sequences are there to be expensive, and to randomize the values a bit. h = m; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; x = h; // another expensive hashing sequence } //System.out.println(10_000 / hits); blackhole.consume(x); } ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973121951 From jpai at openjdk.org Fri Mar 1 12:56:11 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Fri, 1 Mar 2024 12:56:11 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only Message-ID: Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. ------------- Commit messages: - 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only Changes: https://git.openjdk.org/jdk/pull/18078/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18078&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327108 Stats: 7 lines in 1 file changed: 3 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18078.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18078/head:pull/18078 PR: https://git.openjdk.org/jdk/pull/18078 From epeter at openjdk.org Fri Mar 1 13:03:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 13:03:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth The case of Min/Max style if-statements is that both the if and else branch are actually empty, since both values are computed before the if. That is why our `PhaseIdealLoop::conditional_move` will always say that it is profitable: it thinks there is zero cost in the if/else branch, so there is basically no cost. So this kind of cost-modeling based on the if/else blocks is really insufficient. Rather, you would have to know how much cost is behind the two inputs to the cmp. As we see in my example, the cost of `b` can basically be hidden by the branch predictor (at least a part of it). But a CMove/Min/Max has to pay the full cost of `b` before it can continue afterwards. @jaskarth My example is extreme. Feel free to play with my example, and make the `b` part and the "post" part smaller. Maybe there is a regression case that is less extreme. If we could show that only the really extreme examples lead to regressions, then maybe we are willing to bite the bullet on those regressions for the benefit of speedups in other cases. @jaskarth One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization? The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973149078 PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973153118 PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973156599 From epeter at openjdk.org Fri Mar 1 13:06:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 13:06:56 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: <-VecOge93qSNd6pqheFmyoBhkI0_Kkf8A2HN0aiZbqU=.4c6f56f8-0136-44ee-a880-000e826a0642@github.com> On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth now there are some platforms that have horrible branch predictors. On those the cost model would probably favor CMove and Min/Max in more cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973162293 From chagedorn at openjdk.org Fri Mar 1 13:08:52 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:08:52 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. That looks reasonable, thanks for fixing this! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18078#pullrequestreview-1911090301 From jpai at openjdk.org Fri Mar 1 13:12:01 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Fri, 1 Mar 2024 13:12:01 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only Message-ID: Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. ------------- Commit messages: - 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only Changes: https://git.openjdk.org/jdk/pull/18079/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18079&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327105 Stats: 7 lines in 2 files changed: 3 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18079.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18079/head:pull/18079 PR: https://git.openjdk.org/jdk/pull/18079 From epeter at openjdk.org Fri Mar 1 13:12:45 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 13:12:45 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth When I ran `make test TEST="micro:IfMinMax" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"` and checked the generated assembly, I did not find any vector instructions. Could it be that `SIZE=300` is too small? I generally use vector sizes in the range of `10_000`, just to make sure it vectorizes. Maybe it is because I have a avx512 machine with 64byte registers, compared to 32byte registers for AVX2? Not sure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973170971 From chagedorn at openjdk.org Fri Mar 1 13:33:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:33:59 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class Message-ID: In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. #### Redo refactoring of `create_bool_from_template_assertion_predicate()` On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). #### Share data graph cloning code - start from existing code This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: 1. Collect data nodes to clone by using a node filter 2. Clone the collected nodes (their data and control inputs still point to the old nodes) 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. #### Shared data graph cloning class Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] `DataNodeGraph` can then later be reused in JDK-8327110 and JDK-8327111 to refactor `create_bool_from_template_assertion_predicate()`. Thanks to @eme64 for the comments in https://github.com/openjdk/jdk/pull/16877 and the joint effort to find a reproducer of the existing bug which was the main motivation to redo the refactoring. Thanks, Christian ------------- Commit messages: - 8327109: Refactor data graph cloning used for in create_new_if_for_predicate() into separate class Changes: https://git.openjdk.org/jdk/pull/18080/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327109 Stats: 135 lines in 3 files changed: 71 ins; 32 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From chagedorn at openjdk.org Fri Mar 1 13:34:00 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:34:00 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... src/hotspot/share/opto/loopPredicate.cpp line 220: > 218: void PhaseIdealLoop::set_ctrl_of_nodes_with_same_ctrl(Node* start_node, ProjNode* old_uncommon_proj, > 219: Node* new_uncommon_proj) { > 220: ResourceMark rm; Added `ResourceMark`s which I think is now safe after [JDK-8325672](https://bugs.openjdk.org/browse/JDK-8325672). src/hotspot/share/opto/loopPredicate.cpp line 250: > 248: DEBUG_ONLY(uint last_idx = C->unique();) > 249: Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(node, old_ctrl); > 250: Dict old_new_mapping = clone_nodes(nodes_with_same_ctrl); // Cloned but not rewired, yet Replaced `Dict` with `ResizeableResourceHashtable` which I think is preferable to use. src/hotspot/share/opto/loopnode.hpp line 1353: > 1351: void fix_cloned_data_node_controls( > 1352: const ProjNode* old_uncommon_proj, Node* new_uncommon_proj, > 1353: const ResizeableResourceHashtable& orig_to_new); Mostly some renaming and adding `const`. src/hotspot/share/opto/loopnode.hpp line 1899: > 1897: _data_nodes(data_nodes), > 1898: // Use 107 as best guess which is the first resize value in ResizeableResourceHashtable::large_table_sizes. > 1899: _orig_to_new(107, MaxNodeLimit) I'm not sure if this is the right default value - was just a best guess. We usually only have a small number of data nodes to copy. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509020430 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509011640 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509016113 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509014377 From chagedorn at openjdk.org Fri Mar 1 13:35:52 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:35:52 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. That looks reasonable. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18079#pullrequestreview-1911194233 From epeter at openjdk.org Fri Mar 1 14:14:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:47:40 GMT, Emanuel Peter wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > src/hotspot/share/opto/loopPredicate.cpp line 254: > >> 252: const Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(start_node, old_uncommon_proj); >> 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); >> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); > > This was a bit confusing. At first I thought you are cloning the `data_node_graph`, since the `auto` did not tell me that here we are getting a hash-table back. > I wonder if this cloning should be done in the constructor of `DataNodeGraph`. The beauty of packing it into the constructor is that you have fewer lines here. And that is probably beneficial if you are going to use the class elsewhere -> less code duplication. > src/hotspot/share/opto/loopnode.hpp line 1889: > >> 1887: // 1. Clone the data nodes >> 1888: // 2. Fix the cloned data inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. >> 1889: class DataNodeGraph : public StackObj { > > You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. > Suggestion: `OrigToNewHashtable`. The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509063868 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509048691 From epeter at openjdk.org Fri Mar 1 14:14:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:56:52 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.hpp line 1889: >> >>> 1887: // 1. Clone the data nodes >>> 1888: // 2. Fix the cloned data inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. >>> 1889: class DataNodeGraph : public StackObj { >> >> You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. >> Suggestion: `OrigToNewHashtable`. > > The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. Suggestion for better name `CloneDataNodeGraph`. Do you assert that only data nodes are cloned, and no CFG nodes? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509050141 From epeter at openjdk.org Fri Mar 1 14:14:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Looks like a nice refactoring! I left a few comments and questions :) src/hotspot/share/opto/loopPredicate.cpp line 254: > 252: const Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(start_node, old_uncommon_proj); > 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); > 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); This was a bit confusing. At first I thought you are cloning the `data_node_graph`, since the `auto` did not tell me that here we are getting a hash-table back. I wonder if this cloning should be done in the constructor of `DataNodeGraph`. src/hotspot/share/opto/loopPredicate.cpp line 255: > 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); > 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); > 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); And is there a reason why `fix_cloned_data_node_controls` is not part of the `DataNodeGraph` class? Is there any use of the class where we don't have to call `fix_cloned_data_node_controls`? src/hotspot/share/opto/loopPredicate.cpp line 256: > 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); > 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); > 256: Node** cloned_node_ptr = orig_to_new.get(start_node); Boah, this `**` is a bit nasty. Would have been nicer if there was a reference pass instead, which checks already that the element exists. src/hotspot/share/opto/loopPredicate.cpp line 265: > 263: void PhaseIdealLoop::fix_cloned_data_node_controls( > 264: const ProjNode* old_uncommon_proj, Node* new_uncommon_proj, > 265: const ResizeableResourceHashtable& orig_to_new) { Suggestion: const ResizeableResourceHashtable& orig_to_new) { This might also help with understanding the indentation. But this is a taste question for sure. src/hotspot/share/opto/loopPredicate.cpp line 271: > 269: set_ctrl(clone, new_uncommon_proj); > 270: } > 271: }); Indentation is suboptimal here. I found it difficult to read. Style guide: Indentation for multi-line lambda: c.do_entries([&] (const X& x) { do_something(x, a); do_something1(x, b); do_something2(x, c); }); src/hotspot/share/opto/loopPredicate.cpp line 291: > 289: for (uint i = 1; i < next->req(); i++) { > 290: Node* in = next->in(i); > 291: if (!in->is_Phi()) { What happened with the `is_Phi`? Is it not needed anymore? src/hotspot/share/opto/loopnode.hpp line 1889: > 1887: // 1. Clone the data nodes > 1888: // 2. Fix the cloned data inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. > 1889: class DataNodeGraph : public StackObj { You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. Suggestion: `OrigToNewHashtable`. src/hotspot/share/opto/loopnode.hpp line 1921: > 1919: rewire_clones_to_cloned_inputs(); > 1920: return _orig_to_new; > 1921: } Currently, it looks like one could call `clone` multiple times. But I think that would be wrong, right? That is why I'd put all the active logic in the constructor, and only the passive stuff is publicly accessible, with `const` to indicate that these don't have any effect. src/hotspot/share/opto/loopopts.cpp line 4519: > 4517: _orig_to_new.iterate_all([&](Node* node, Node* clone) { > 4518: for (uint i = 1; i < node->req(); i++) { > 4519: Node** cloned_input = _orig_to_new.get(node->in(i)); You don't need to check for `is_Phi` on `node->in(i)` anymore? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1911220168 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509038222 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509065385 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509040263 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509045154 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509044654 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509060128 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509047305 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509057459 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509060906 From epeter at openjdk.org Fri Mar 1 14:14:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: <7kG-lqyCbXktVszrDRPmc2SkQYXxnqcG9HmCmR5YSCQ=.1bd20dcb-5285-4746-b579-502029c0d9bd@github.com> On Fri, 1 Mar 2024 13:58:09 GMT, Emanuel Peter wrote: >> The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. > > Suggestion for better name `CloneDataNodeGraph`. Do you assert that only data nodes are cloned, and no CFG nodes? Yes, you do verify it, great! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509050744 From sviswanathan at openjdk.org Fri Mar 1 17:04:55 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 1 Mar 2024 17:04:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Fri, 1 Mar 2024 08:15:50 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Update description of Poly1305 algo > > src/hotspot/cpu/x86/assembler_x86.cpp line 5148: > >> 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >> 5147: InstructionMark im(this); >> 5148: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); > > uses_vl should be false here. > > BTW, this assertion looks very fuzzy, you are checking for two target features in one instruction, apparently, instruction is meant to use AVX512_IFMA only for 512 bit vector length, and for narrower vectors its needs AVX_IFMA. > > Lets either keep this strictly for AVX_IFMA for AVX512_IFMA we already have evpmadd52[l/h]uq, if you truly want to make this generic one then split the assertion > > `assert ( (avx_ifma && vector_len <= 256) || (avx512_ifma && (vector_len == 512 || VM_Version::support_vl())); > ` > > And then you may pass uses_vl at true. It would be good to make this instruction generic. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1509271081 From kvn at openjdk.org Fri Mar 1 17:25:42 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 17:25:42 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18079#pullrequestreview-1911666478 From kvn at openjdk.org Fri Mar 1 17:26:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 17:26:52 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18078#pullrequestreview-1911667546 From gdub at openjdk.org Fri Mar 1 17:54:01 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Fri, 1 Mar 2024 17:54:01 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one Message-ID: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. ------------- Commit messages: - Fix JVMCI Local endBCI off-by-one error - Add javadoc and minimal test for Local.getStart/EndBCI Changes: https://git.openjdk.org/jdk/pull/18087/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18087&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8326692 Stats: 31 lines in 3 files changed: 27 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18087.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18087/head:pull/18087 PR: https://git.openjdk.org/jdk/pull/18087 From kvn at openjdk.org Fri Mar 1 18:41:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 18:41:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: <02i_iMQdWewT8LDPxjnCcJ4EcojEMX763TVw8xGCo5I=.95697bd7-162d-4605-a4f0-b7689ddcbfa4@github.com> On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture For aarch64 do we need also change *.m4 files? Anything in aarch64_vector* files? I will run our testing with current patch. ------------- PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1911800568 From dnsimon at openjdk.org Fri Mar 1 18:54:53 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 1 Mar 2024 18:54:53 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. Thanks for fixing this Gilles. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18087#pullrequestreview-1911818421 From never at openjdk.org Fri Mar 1 19:02:42 2024 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 1 Mar 2024 19:02:42 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18087#pullrequestreview-1911829737 From duke at openjdk.org Fri Mar 1 19:17:12 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 1 Mar 2024 19:17:12 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs Message-ID: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. Below is the performance data on an Intel Tiger Lake machine. Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup -- | -- | -- | -- MathBench.ceilDouble | 547979 | 2170198 | 3.96 MathBench.floorDouble | 547979 | 2167459 | 3.96 MathBench.rintDouble | 547962 | 2130499 | 3.89 ------------- Commit messages: - 8327147: optimized implementation of round operation for x86_64 CPUs Changes: https://git.openjdk.org/jdk/pull/18089/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327147 Stats: 14 lines in 1 file changed: 14 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From kvn at openjdk.org Fri Mar 1 21:07:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 21:07:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture My testing shows that when we do **cross compilation** on linux-x64 I got: Warning: unused operand (no_rax_RegP) Normal linux-x64 build passed. The operand is used only in one place in ZGC barriers code: `src/hotspot/cpu/x86/gc/z/z_x86_64.ad` May be it is not include during cross compilation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1973917187 From kvn at openjdk.org Fri Mar 1 21:27:43 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 21:27:43 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture I am testing the move `operand no_rax_RegP` into `z_x86_64.ad`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1973940841 From kvn at openjdk.org Fri Mar 1 21:32:56 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 21:32:56 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture FTR, I also bailout from adlc parser when we have unused operand to force build failure: +++ b/src/hotspot/share/adlc/archDesc.cpp @@ -773,8 +774,11 @@ bool ArchDesc::check_usage() { cnt++; } } - if (cnt) fprintf(stderr, "\n-------Warning: total %d unused operands\n", cnt); - + if (cnt) { + fprintf(stderr, "\n-------Warning: total %d unused operands\n", cnt); + _semantic_errs++; + return false; + } return true; } I don't think we need it in these changes but it helped me to catch the missing case. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1973947309 From sviswanathan at openjdk.org Fri Mar 1 21:51:00 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 1 Mar 2024 21:51:00 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: <-wYl_0Etz21KtNOTz-Q9-hyxIvKJC2ufHC42IOpYcLM=.e5150619-c877-4527-9772-ea552ce4871c@github.com> On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > src/hotspot/cpu/x86/x86.ad line 3895: > 3893: > 3894: /* > 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ The roundD_mem instruct could be removed now that it is not used. Also the PR could be titled as "Improve performance of Math ceil, floor, and rint for x86". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509583662 From dlong at openjdk.org Fri Mar 1 21:51:00 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 1 Mar 2024 21:51:00 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > src/hotspot/cpu/x86/x86.ad line 3895: > 3893: > 3894: /* > 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ Don't we want roundD_mem enabled, for both roundsd (UseAVX == 0) and vroundsd (UseAVX > 0)? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509586262 From sviswanathan at openjdk.org Fri Mar 1 22:27:52 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 1 Mar 2024 22:27:52 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: <9y7CJbRaeJdn4lR3mwrYbDmvldllWAeUPmmlXHh2Jg0=.c3167586-3cbb-4243-a7e5-d56320a598af@github.com> On Fri, 1 Mar 2024 21:46:37 GMT, Dean Long wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > src/hotspot/cpu/x86/x86.ad line 3895: > >> 3893: >> 3894: /* >> 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ > > Don't we want roundD_mem enabled, for both roundsd (UseAVX == 0) and vroundsd (UseAVX > 0)? @dean-long the roundD_mem instruct is the cause of slow performance due to a false dependency. It generates the instruction of the following form which has a 128 bit result: roundsd xmm0, memory_src, mode vroundsd xmm0, xmm0, memory_src, mode xmm0 bits 0:63 are result of round operation on memory_src xmm0 bits 64:128 are dependent on old value of xmm0 (false dependency) By forcing the load of memory_src into a register before the operation by as below removes the false dependency: vmovsd xmm0, memory_src ; bits 64 and above are cleared by vmovsd vroundsd xmm0, xmm0, xmm0, mode ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509635469 From dlong at openjdk.org Fri Mar 1 23:18:52 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 1 Mar 2024 23:18:52 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <-wYl_0Etz21KtNOTz-Q9-hyxIvKJC2ufHC42IOpYcLM=.e5150619-c877-4527-9772-ea552ce4871c@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> <-wYl_0Etz21KtNOTz-Q9-hyxIvKJC2ufHC42IOpYcLM=.e5150619-c877-4527-9772-ea552ce4871c@github.com> Message-ID: On Fri, 1 Mar 2024 21:44:32 GMT, Sandhya Viswanathan wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > src/hotspot/cpu/x86/x86.ad line 3895: > >> 3893: >> 3894: /* >> 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ > > The roundD_mem instruct could be removed now that it is not used. Also the PR could be titled as "Improve performance of Math ceil, floor, and rint for x86". OK, let's remove roundD_mem to avoid confusion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509660183 From kvn at openjdk.org Fri Mar 1 23:34:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 23:34:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture Testing build with moved `operand no_rax_RegP` passed. Please update changes with this: diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad index d43929efd3e..aef3453b0b1 100644 --- a/src/hotspot/cpu/x86/x86_64.ad +++ b/src/hotspot/cpu/x86/x86_64.ad @@ -2663,18 +2561,6 @@ operand rRegN() %{ // the RBP is used as a proper frame pointer and is not included in ptr_reg. As a // result, RBP is not included in the output of the instruction either. -operand no_rax_RegP() -%{ - constraint(ALLOC_IN_RC(ptr_no_rax_reg)); - match(RegP); - match(rbx_RegP); - match(rsi_RegP); - match(rdi_RegP); - - format %{ %} - interface(REG_INTER); -%} - // This operand is not allowed to use RBP even if // RBP is not used to hold the frame pointer. operand no_rbp_RegP() diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad index d178805dfc7..0cc2ea03b35 100644 --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad @@ -99,6 +99,18 @@ static void z_store_barrier(MacroAssembler& _masm, const MachNode* node, Address %} +operand no_rax_RegP() +%{ + constraint(ALLOC_IN_RC(ptr_no_rax_reg)); + match(RegP); + match(rbx_RegP); + match(rsi_RegP); + match(rdi_RegP); + + format %{ %} + interface(REG_INTER); +%} + // Load Pointer instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) %{ ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1974071060 From jpai at openjdk.org Sat Mar 2 01:47:53 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:53 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: <0eNqq-XwZUEVM6vEtKi7WloB0hzw70lcwdp08jyOjbM=.d31c9e5f-2c1f-4d3e-8f75-f85c6b8fb54f@github.com> On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. Thank you Christian and Vladimir for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18078#issuecomment-1974177678 From jpai at openjdk.org Sat Mar 2 01:47:58 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:58 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. Thank you for the reviews, Christian and Vladimir. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18079#issuecomment-1974179133 From jpai at openjdk.org Sat Mar 2 01:47:53 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:53 GMT Subject: Integrated: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. This pull request has now been integrated. Changeset: a9c17a22 Author: Jaikiran Pai URL: https://git.openjdk.org/jdk/commit/a9c17a22ca8e64d12e28e272e3f4845297290854 Stats: 7 lines in 1 file changed: 3 ins; 1 del; 3 mod 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18078 From jpai at openjdk.org Sat Mar 2 01:47:58 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:58 GMT Subject: Integrated: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. This pull request has now been integrated. Changeset: f68a4b9f Author: Jaikiran Pai URL: https://git.openjdk.org/jdk/commit/f68a4b9fc4b0add186754465bbeb908b8362be8d Stats: 7 lines in 2 files changed: 3 ins; 0 del; 4 mod 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18079 From fyang at openjdk.org Sat Mar 2 03:08:46 2024 From: fyang at openjdk.org (Fei Yang) Date: Sat, 2 Mar 2024 03:08:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture The RISC-V part of the change looks fine. Note that GHA failure is infrastructual. Debian sid is broken for now: https://bugs.openjdk.org/browse/JDK-8326960 ------------- PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1912564165 From epeter at openjdk.org Sat Mar 2 10:58:44 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Sat, 2 Mar 2024 10:58:44 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth It seems we were aware of such issues a long time ago: https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86 So we may actually have use `if` for min/max instead of CMove, at least on some platforms. But some platforms may have worse branch predictors, and then we should use CMove more often. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1974762928 From gli at openjdk.org Sat Mar 2 11:31:51 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 11:31:51 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethodImpl.java line 635: > 633: for (int i = 0; i < localVariableTableLength; i++) { > 634: final int startBci = UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementStartBciOffset); > 635: final int endBci = startBci + UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementLengthOffset) - 1; Just a question: Can the length of a local variable be 0? **If the code length is 0, the `endBci` here may be less than `startBci`.** ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1509950353 From gli at openjdk.org Sat Mar 2 11:44:51 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 11:44:51 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > src/hotspot/cpu/x86/x86.ad line 3894: > 3892: %} > 3893: > 3894: /* Just a notice: if we don't need some code, we should remove them instead of commenting out them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509951974 From dnsimon at openjdk.org Sat Mar 2 12:12:51 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Sat, 2 Mar 2024 12:12:51 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Sat, 2 Mar 2024 11:28:43 GMT, Guoxiong Li wrote: >> In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). >> On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). >> Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. >> >> A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethodImpl.java line 635: > >> 633: for (int i = 0; i < localVariableTableLength; i++) { >> 634: final int startBci = UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementStartBciOffset); >> 635: final int endBci = startBci + UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementLengthOffset) - 1; > > Just a question: Can the length of a local variable be 0? > > **If the code length is 0, the `endBci` here may be less than `startBci`.** I don't see anything in [JVMS 4.7.13](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.13) that says it cannot be 0. It basically means the LVT entry is useless (denotes a local that is never alive) but is otherwise harmless. Maybe add this to the javadoc for `getEndBci()` to make the API user aware of this corner case: If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1509955642 From gli at openjdk.org Sat Mar 2 12:24:52 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 12:24:52 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Sat, 2 Mar 2024 12:10:35 GMT, Doug Simon wrote: >> src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethodImpl.java line 635: >> >>> 633: for (int i = 0; i < localVariableTableLength; i++) { >>> 634: final int startBci = UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementStartBciOffset); >>> 635: final int endBci = startBci + UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementLengthOffset) - 1; >> >> Just a question: Can the length of a local variable be 0? >> >> **If the code length is 0, the `endBci` here may be less than `startBci`.** > > I don't see anything in [JVMS 4.7.13](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.13) that says it cannot be 0. It basically means the LVT entry is useless (denotes a local that is never alive) but is otherwise harmless. > Maybe add this to the javadoc for `getEndBci()` to make the API user aware of this corner case: > > If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. The reason, which causes this problem, is that the `Local::endBci` includes itself instead of excluding it. But now, we can only fix the javadoc just as you suggested. > If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. `a local variable` may be better to `a local` above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1509957764 From jbhateja at openjdk.org Sat Mar 2 16:22:22 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Mar 2024 16:22:22 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v17] In-Reply-To: References: Message-ID: > Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro. > > ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) > > > 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review resolutions. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16354/files - new: https://git.openjdk.org/jdk/pull/16354/files/b971fbb7..0b270d2e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=15-16 Stats: 25 lines in 4 files changed: 10 ins; 9 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/16354.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16354/head:pull/16354 PR: https://git.openjdk.org/jdk/pull/16354 From jbhateja at openjdk.org Sat Mar 2 16:36:51 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Mar 2024 16:36:51 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > Changes requested by jbhateja (Reviewer). src/hotspot/cpu/x86/x86.ad line 3884: > 3882: > 3883: instruct roundD_reg_avx(legRegD dst, legRegD src, immU8 rmode) %{ > 3884: predicate(UseAVX > 0); can you push the predicate in instruction encoding block and fold this pattern with roundD_reg. ------------- PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1912689349 PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1510001900 From gdub at openjdk.org Sat Mar 2 17:39:52 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Sat, 2 Mar 2024 17:39:52 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: <_JNCAXiYyN2WCHzga4-m0hq4Fy-Na4ssbkegh0e_etE=.b9507756-eb7a-4b81-97d3-78f9824cfa17@github.com> On Sat, 2 Mar 2024 12:21:51 GMT, Guoxiong Li wrote: >> I don't see anything in [JVMS 4.7.13](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.13) that says it cannot be 0. It basically means the LVT entry is useless (denotes a local that is never alive) but is otherwise harmless. >> Maybe add this to the javadoc for `getEndBci()` to make the API user aware of this corner case: >> >> If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. > > The reason, which causes this problem, is that the `Local::endBci` includes itself instead of excluding it. But now, we can only fix the javadoc just as you suggested. > >> If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. > > `a local variable` may be better to `a local` above. I had checked the specs on that and came to the same conclusion. I also think the current state is fine in that regards in terms of code since it just means that there is no bci where this local would be valid when checking both start and end bci. Adding a note about that to the javadoc is a good idea. I'll do that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1510010422 From gdub at openjdk.org Sat Mar 2 17:58:01 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Sat, 2 Mar 2024 17:58:01 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one [v2] In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: <9uTHB3xtVIXw_dhZxFZBx6krgmipxeaa3DGIM52ueLs=.f63bd368-80d2-4fe6-b18c-f0896246957e@github.com> > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. Gilles Duboscq has updated the pull request incrementally with one additional commit since the last revision: Add note about zero-length locals ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18087/files - new: https://git.openjdk.org/jdk/pull/18087/files/90e96b4e..fe1ee476 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18087&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18087&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18087.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18087/head:pull/18087 PR: https://git.openjdk.org/jdk/pull/18087 From gli at openjdk.org Sat Mar 2 23:44:51 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 23:44:51 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one [v2] In-Reply-To: <9uTHB3xtVIXw_dhZxFZBx6krgmipxeaa3DGIM52ueLs=.f63bd368-80d2-4fe6-b18c-f0896246957e@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> <9uTHB3xtVIXw_dhZxFZBx6krgmipxeaa3DGIM52ueLs=.f63bd368-80d2-4fe6-b18c-f0896246957e@github.com> Message-ID: On Sat, 2 Mar 2024 17:58:01 GMT, Gilles Duboscq wrote: >> In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). >> On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). >> Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. >> >> A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. > > Gilles Duboscq has updated the pull request incrementally with one additional commit since the last revision: > > Add note about zero-length locals Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18087#pullrequestreview-1912780304 From gdub at openjdk.org Sun Mar 3 11:08:55 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Sun, 3 Mar 2024 11:08:55 GMT Subject: Integrated: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. This pull request has now been integrated. Changeset: 31ac8714 Author: Gilles Duboscq URL: https://git.openjdk.org/jdk/commit/31ac8714e0593f2feaa8e9ebaf32bab904ba6d11 Stats: 33 lines in 3 files changed: 29 ins; 1 del; 3 mod 8326692: JVMCI Local.endBci is off-by-one Reviewed-by: dnsimon, never, gli ------------- PR: https://git.openjdk.org/jdk/pull/18087 From epeter at openjdk.org Mon Mar 4 08:13:00 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 08:13:00 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v17] In-Reply-To: References: Message-ID: On Sat, 2 Mar 2024 16:22:22 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) >> >> >> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review resolutions. @jatin-bhateja thanks for all the work, this is a really nice feature! And thanks for baring with all the comments ? Testing up to commit 14 looks good. @PaulSandoz thanks for looking at the Vector API java code! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1913610983 From chagedorn at openjdk.org Mon Mar 4 08:13:55 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: <85eDvI-w8-zdB4MVfI7I7sZ1M63kw4QqDND_2BqMv5w=.415e84ac-4afb-4c69-b4b6-f045dc67449b@github.com> On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Thanks for the careful review! I could have added some more comments explaining, what the follow-up refactoring will do to help to better understand some of the decisions in this patch. I've added replies and will update the PR shortly with the mentionied changes. ------------- PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1913506965 From chagedorn at openjdk.org Mon Mar 4 08:13:56 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 14:09:58 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopPredicate.cpp line 254: >> >>> 252: const Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(start_node, old_uncommon_proj); >>> 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); >>> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); >> >> This was a bit confusing. At first I thought you are cloning the `data_node_graph`, since the `auto` did not tell me that here we are getting a hash-table back. >> I wonder if this cloning should be done in the constructor of `DataNodeGraph`. > > The beauty of packing it into the constructor is that you have fewer lines here. And that is probably beneficial if you are going to use the class elsewhere -> less code duplication. Generally, I think there is this debate about how much work one should do in the constructor (minimal vs. maximal) and I guess there is no clear consensus. In the compiler code, we seem to tend more towards doing the work in the constructor. I agree that packing it all together to hide it from the user is quite nice. However, in this case here, `DataNodeGraph` is later extended (when refactoring `create_bool_from_template_assertion_predicate()` in JDK-8327110/8327111) to not only clone but also clone+transform opaque loop nodes (offering an additional method). This was the main reason I went with a separation here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510673858 From chagedorn at openjdk.org Mon Mar 4 08:13:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:59 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 14:11:17 GMT, Emanuel Peter wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > src/hotspot/share/opto/loopPredicate.cpp line 255: > >> 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); >> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); >> 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); > > And is there a reason why `fix_cloned_data_node_controls` is not part of the `DataNodeGraph` class? Is there any use of the class where we don't have to call `fix_cloned_data_node_controls`? The way we fix the control inputs here is very specific to this code (I don't think we'll do something similar elsewhere). `create_bool_from_template_assertion_predicate()` only does cloning of non-pinned nodes and does not need to do rewire controls - I think this code could be reused at other places as well but that could be cleaned up separately. If we later refactor other cases which needs to rewire the control nodes in a specific way, we could still try to move the code of `fix_cloned_data_node_controls()` inside `DataNodeGraph` and try to share it. > src/hotspot/share/opto/loopPredicate.cpp line 256: > >> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); >> 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); >> 256: Node** cloned_node_ptr = orig_to_new.get(start_node); > > Boah, this `**` is a bit nasty. Would have been nicer if there was a reference pass instead, which checks already that the element exists. That's indeed not that great. Maybe the hash table class should provide an extra function to get references/pointers back (I see why returning a pointer is useful when you directly store objects instead of pointers into the hash table) - not sure though if we should squeeze that into this PR. Maybe in a separate RFE? > src/hotspot/share/opto/loopPredicate.cpp line 265: > >> 263: void PhaseIdealLoop::fix_cloned_data_node_controls( >> 264: const ProjNode* old_uncommon_proj, Node* new_uncommon_proj, >> 265: const ResizeableResourceHashtable& orig_to_new) { > > Suggestion: > > const ResizeableResourceHashtable& orig_to_new) > { > > This might also help with understanding the indentation. But this is a taste question for sure. Will change with indentation fix. > src/hotspot/share/opto/loopPredicate.cpp line 271: > >> 269: set_ctrl(clone, new_uncommon_proj); >> 270: } >> 271: }); > > Indentation is suboptimal here. I found it difficult to read. > Style guide: > > > Indentation for multi-line lambda: > > c.do_entries([&] (const X& x) { > do_something(x, a); > do_something1(x, b); > do_something2(x, c); > }); Good point, I was not aware of this formatting rule. Will fix that. > src/hotspot/share/opto/loopPredicate.cpp line 291: > >> 289: for (uint i = 1; i < next->req(); i++) { >> 290: Node* in = next->in(i); >> 291: if (!in->is_Phi()) { > > What happened with the `is_Phi`? Is it not needed anymore? See later comment in `DataNodeGraph`. > src/hotspot/share/opto/loopnode.hpp line 1921: > >> 1919: rewire_clones_to_cloned_inputs(); >> 1920: return _orig_to_new; >> 1921: } > > Currently, it looks like one could call `clone` multiple times. But I think that would be wrong, right? > That is why I'd put all the active logic in the constructor, and only the passive stuff is publicly accessible, with `const` to indicate that these don't have any effect. Yes, that would be unexpected, so I agree with you here. But as mentioned earlier, we need to add another method to this class later which does the cloning slightly differently, so we cannot do all the work in the constructor. We probably have multiple options here: - Do nothing (could be reasonable as this class is only used rarely and if it's used it's most likely uncommon to clone twice in a row on the same object - and if one does, one probably has a look at the class anyway to notice what to do). - Add asserts to ensure `clone()` is only called once (adds more code but could be a low overhead option - however, we should think about whether we really want to save the user from itself). - Return a copy of the hash table and clear it afterward (seems too much overhead for having no such use-case). I think option 1 and 2 are both fine. > src/hotspot/share/opto/loopopts.cpp line 4519: > >> 4517: _orig_to_new.iterate_all([&](Node* node, Node* clone) { >> 4518: for (uint i = 1; i < node->req(); i++) { >> 4519: Node** cloned_input = _orig_to_new.get(node->in(i)); > > You don't need to check for `is_Phi` on `node->in(i)` anymore? Could have added a comment here about the `is_Phi()` drop. The `DataNodeGraph` class already takes a node collection to clone. We therefore do not need to additionally check for `is_Phi()` here. If an input is a phi, it would not have been cloned in the first place because the node collection does not contain phis (L239): https://github.com/openjdk/jdk/blob/c00cc8ffaee9bf9b3278d84afba0af2ac00134de/src/hotspot/share/opto/loopPredicate.cpp#L231-L245 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510681911 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510686211 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510708206 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510696808 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510699583 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510721511 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510703408 From chagedorn at openjdk.org Mon Mar 4 08:13:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:59 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: <7kG-lqyCbXktVszrDRPmc2SkQYXxnqcG9HmCmR5YSCQ=.1bd20dcb-5285-4746-b579-502029c0d9bd@github.com> References: <7kG-lqyCbXktVszrDRPmc2SkQYXxnqcG9HmCmR5YSCQ=.1bd20dcb-5285-4746-b579-502029c0d9bd@github.com> Message-ID: On Fri, 1 Mar 2024 13:58:42 GMT, Emanuel Peter wrote: >> Suggestion for better name `CloneDataNodeGraph`. Do you assert that only data nodes are cloned, and no CFG nodes? > > Yes, you do verify it, great! > You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. Suggestion: `OrigToNewHashtable`. Good idea. I'll add one. > The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. Suggestion for better name CloneDataNodeGraph. As mentioned earlier, we are later gonna reuse this class when refactoring `create_bool_from_template_assertion_predicate()`. For template assertion predicates we not only need to clone nodes but also need to transform the `OpaqueLoop*Nodes`. Therefore, I went with keeping the name of this class as `DataNodeGraph` and use `_orig_to_new` and not use `_orig_to_clone` since we could be transforming `OpaqueLoop*Nodes` in such a way that we replace it with existing nodes and not clones. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510704146 From epeter at openjdk.org Mon Mar 4 08:19:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 08:19:58 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: <85eDvI-w8-zdB4MVfI7I7sZ1M63kw4QqDND_2BqMv5w=.415e84ac-4afb-4c69-b4b6-f045dc67449b@github.com> References: <85eDvI-w8-zdB4MVfI7I7sZ1M63kw4QqDND_2BqMv5w=.415e84ac-4afb-4c69-b4b6-f045dc67449b@github.com> Message-ID: On Mon, 4 Mar 2024 08:10:17 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Thanks for the careful review! I could have added some more comments explaining, what the follow-up refactoring will do to help to better understand some of the decisions in this patch. I've added replies and will update the PR shortly with the mentionied changes. @chhagedorn ah ok, I see. I didn't quite realize how you were going to extend the code later before your comments. In that case you can of course leave the computations outside the constructor. We can still discuss the final shape of the code once you do the next RFE's on the same code :) I'll wait for your code updates to re-review, just ping me ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18080#issuecomment-1975967965 From jkarthikeyan at openjdk.org Mon Mar 4 08:21:55 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 4 Mar 2024 08:21:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Sat, 2 Mar 2024 10:55:41 GMT, Emanuel Peter wrote: >> @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 >> And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c >> >> I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 > > @jaskarth It seems we were aware of such issues a long time ago: > https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86 > So we may actually have use `if` for min/max instead of CMove, at least on some platforms. > But some platforms may have worse branch predictors, and then we should use CMove more often. Hey @eme64, first of all, I want to thank you for your detailed analysis, and the added benchmark! I hope to answer your questions below. > I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster. I think this is a good point as well, I'll try to design a benchmark for this. > You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic? In my benchmark, I found that `testSingleInt` wasn't turned into a `CMove`, but after some more investigation I think this is because of a mistake in my benchmark. I mistakenly select the 0th element of the arrays every time, when I should be randomly selecting indices to prevent a side of the branch from being optimized out. When that change is made, it produces a CMove. I also recall finding cases earlier where CMoves in loops weren't created, but I think this must have been before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451) was integrated. I'll keep searching for more cases, but I tried a few examples and couldn't really find any where a minmax was made but the CMove wasn't. > I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why. I think looking at the code in `PhaseIdealLoop::conditional_move`, the primary difference is that CMove code has an additional cost metric for loops, whereas is_minmax only has the `PROB_UNLIKELY_MAG(2)` check that the CMove logic uses when not in a loop. I think this might potentially lead to minmax transforming cases in loops that `CMove` might not have- but that may not necessarily be desireable. > One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization? The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply. I think the primary benefit of this optimization in straight-line code is the tightened bounds of the Min/Max node as compared to the equivalent Phi or `CMove`. If we have `CMove(0, int_bottom)` then its type would be `int_bottom`, as it does a meet over the operands. But if it were a Max instead, its type would be `[0, int_max]`, which is a sharper type. As an example: int b = ...; // int_bottom int c = b < 0 ? 0 : b; if (c < 0) { ...; // dead code } This example is a bit contrived, but previously that branch would not have been pruned. I found this kind of optimization hard to look for, so I added a temporary field to MaxNode that would only be set to true when MaxNodes were created by is_minmax, and dumped the results of `MaxINode::add_ring` when it was called with the field as true. When running the test suite, I saw there were many cases where this transform was able to create a better type for its operands than an equivalent cmove or phi, and in some cases it was even able to statically determine the operation to be a constant value. > When I ran `make test TEST="micro:IfMinMax" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"` and checked the generated assembly, I did not find any vector instructions. Could it be that `SIZE=300` is too small? I generally use vector sizes in the range of `10_000`, just to make sure it vectorizes. Maybe it is because I have a avx512 machine with 64byte registers, compared to 32byte registers for AVX2? Not sure. That is interesting, as when running that command I see vectorization on my machine, at least with `testVector*`. `testReduction*` still needs that patch you linked earlier to work. I will increase the iteration count as suggested though, in case that is the cause of the discrepancy. > Let me know what you think. Not sure if this regression is important enough, but we need to consider what to do about your patch, as well as the CMove logic that already exists. I think it is definitely worth it considering this regression, as I want to ideally minimize regressions at all. With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on `CMove` directly, as @merykitty suggested earlier. This would mean we wouldn't need to approximate the `CMove` heuristic in `is_minmax`, and that we would see benefits in tandem with improvements to our `CMove` heuristic. That way if the `CMove` heuristic was changed later to take into account the cost behind the cmp, it would also fix this case. Do you have any thoughts on this @eme64 (and @merykitty)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1975971570 From chagedorn at openjdk.org Mon Mar 4 08:41:06 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:41:06 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v2] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Review Emanuel: Add typedef and replace usages, format lambda, some renaming ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/c00cc8ff..a569132e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=00-01 Stats: 18 lines in 4 files changed: 3 ins; 2 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From epeter at openjdk.org Mon Mar 4 09:02:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:02:58 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> On Mon, 4 Mar 2024 08:18:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth It seems we were aware of such issues a long time ago: >> https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86 >> So we may actually have use `if` for min/max instead of CMove, at least on some platforms. >> But some platforms may have worse branch predictors, and then we should use CMove more often. > > Hey @eme64, first of all, I want to thank you for your detailed analysis, and the added benchmark! I hope to answer your questions below. > >> I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster. > > I think this is a good point as well, I'll try to design a benchmark for this. > >> You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic? > > In my benchmark, I found that `testSingleInt` wasn't turned into a `CMove`, but after some more investigation I think this is because of a mistake in my benchmark. I mistakenly select the 0th element of the arrays every time, when I should be randomly selecting indices to prevent a side of the branch from being optimized out. When that change is made, it produces a CMove. I also recall finding cases earlier where CMoves in loops weren't created, but I think this must have been before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451) was integrated. I'll keep searching for more cases, but I tried a few examples and couldn't really find any where a minmax was made but the CMove wasn't. > >> I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why. > > I think looking at the code in `PhaseIdealLoop::conditional_move`, the primary difference is that CMove code has an additional cost metric for loops, whereas is_minmax only has the `PROB_UNLIKELY_MAG(2)` check that the CMove logic uses when not in a loop. I think this might potentially lead to minmax transforming cases in loops that `CMove` might not have- but that may not necessarily be desireable. > >> One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization? > The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply. > > I think the primary benefit of this optimization in straight-line code is the tightened bounds of the Min/Max node as compared to the equivalent Ph... @jaskarth > With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on CMove directly You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? I guess that would allow you to get better types, without having to deal with all the CMove-vs-branch-prediction heuristics. BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday: `Branchless Programming in C++ - Fedor Pikus - CppCon 2021` https://www.youtube.com/watch?v=g-WPhYREFjk My conclusion from that: it is really hard to say ahead of time if the branch-predictor is successful. It depends on how predictable a condition is. The branch-predictor can see patterns (like alternating true-false). So even if a probability is 50% on a branch, it may be fully predictable, and branching code is much more efficient than branchless code. But in totally random cases, branchless code may be faster because you will have a large percentage of mispredictions, and mispredictions are expensive. But in both cases you would see `iff->_prob = 0.5`. Really what we would need is profiling that checks how much a branch was `mispredicted`, and not how much it was `taken`. But not sure if we can even get that profiling data. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1976054500 From galder at openjdk.org Mon Mar 4 09:12:12 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 4 Mar 2024 09:12:12 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: > Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. > > The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: > > > $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op > ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op > ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op > ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op > ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op > ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op > ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op > ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op > ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op > ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op > ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op > ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op > ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op > ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op > ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op > > > It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. > > I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. > > > $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" > ... > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 > > > One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? > > Thanks @rwestrel for his help shaping this up :) Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'master' into topic.0131.c1-array-clone - Reserve necessary frame map space for clone use cases - 8302850: C1 primitive array clone intrinsic in graph * Combine array length, new type array and arraycopy for clone in c1 graph. * Add OmitCheckFlags to skip arraycopy checks. * Instantiate ArrayCopyStub only if necessary. * Avoid zeroing newly created arrays for clone. * Add array null after c1 clone compilation test. * Pass force reexecute to intrinsic via value stack. This is needed to be able to deoptimize correctly this intrinsic. * When new type array or array copy are used for the clone intrinsic, their state needs to be based on the state before for deoptimization to work as expected. - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" This reverts commit fe5d916724614391a685bbef58ea939c84197d07. - 8302850: Link code emit infos for null check and alloc array - 8302850: Null check array before getting its length * Added a jtreg test to verify the null check works. Without the fix this test fails with a SEGV crash. - 8302850: Force reexecuting clone in case of a deoptimization * Copy state including locals for clone so that reexecution works as expected. - 8302850: Avoid instantiating array copy stub for clone use cases - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 * Clone calls that involve Phi nodes are not supported. * Add unimplemented stubs for other platforms. ------------- Changes: https://git.openjdk.org/jdk/pull/17667/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17667&range=05 Stats: 218 lines in 16 files changed: 184 ins; 4 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/17667.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17667/head:pull/17667 PR: https://git.openjdk.org/jdk/pull/17667 From chagedorn at openjdk.org Mon Mar 4 09:24:06 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 09:24:06 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Remove dead declaration ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/a569132e..79b8b270 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From epeter at openjdk.org Mon Mar 4 09:36:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:36:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:24:06 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Remove dead declaration Nice, looks better already :) src/hotspot/share/opto/loopnode.hpp line 38: > 36: class BaseCountedLoopEndNode; > 37: class CountedLoopNode; > 38: class DataInputGraph; Suggestion: src/hotspot/share/opto/loopopts.cpp line 4500: > 4498: > 4499: // Clone all nodes in _data_nodes. > 4500: void DataNodeGraph::clone_nodes(Node* new_ctrl) { Suggestion: void DataNodeGraph::clone_data_nodes(Node* new_ctrl) { Then the comment would be obsolete src/hotspot/share/opto/replacednodes.cpp line 211: > 209: } > 210: // Map from current node to cloned/replaced node > 211: OrigToNewHashtable clones(hash_table_size, hash_table_size); Nice. Not your problem here. But should there not be a ResouceMark before this hashtable? There is one at the beginning of the function, but we create many of these hashtables in a loop, without any ResourceMarks in between reclaiming the memory... A but then the hashmaps and stack/to_fix etc would allocate from the ResourceArea, but start at different ResourceMarks... bad idea. Hmm. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1913808738 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510842256 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510845281 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510859503 From epeter at openjdk.org Mon Mar 4 09:36:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:36:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:21:31 GMT, Emanuel Peter wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove dead declaration > > src/hotspot/share/opto/loopnode.hpp line 38: > >> 36: class BaseCountedLoopEndNode; >> 37: class CountedLoopNode; >> 38: class DataInputGraph; > > Suggestion: Is also dead ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510852262 From epeter at openjdk.org Mon Mar 4 09:36:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:36:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 08:00:04 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/loopnode.hpp line 1921: >> >>> 1919: rewire_clones_to_cloned_inputs(); >>> 1920: return _orig_to_new; >>> 1921: } >> >> Currently, it looks like one could call `clone` multiple times. But I think that would be wrong, right? >> That is why I'd put all the active logic in the constructor, and only the passive stuff is publicly accessible, with `const` to indicate that these don't have any effect. > > Yes, that would be unexpected, so I agree with you here. But as mentioned earlier, we need to add another method to this class later which does the cloning slightly differently, so we cannot do all the work in the constructor. > > We probably have multiple options here: > - Do nothing (could be reasonable as this class is only used rarely and if it's used it's most likely uncommon to clone twice in a row on the same object - and if one does, one probably has a look at the class anyway to notice what to do). > - Add asserts to ensure `clone()` is only called once (adds more code but could be a low overhead option - however, we should think about whether we really want to save the user from itself). > - Return a copy of the hash table and clear it afterward (seems too much overhead for having no such use-case). > > I think option 1 and 2 are both fine. Could you add asserts that the `_orig_to_new` is empty before we clone? That would be a check that nothing was cloned yet, and we do not accidentally mix up two clone operations. >> src/hotspot/share/opto/loopopts.cpp line 4519: >> >>> 4517: _orig_to_new.iterate_all([&](Node* node, Node* clone) { >>> 4518: for (uint i = 1; i < node->req(); i++) { >>> 4519: Node** cloned_input = _orig_to_new.get(node->in(i)); >> >> You don't need to check for `is_Phi` on `node->in(i)` anymore? > > Could have added a comment here about the `is_Phi()` drop. The `DataNodeGraph` class already takes a node collection to clone. We therefore do not need to additionally check for `is_Phi()` here. If an input is a phi, it would not have been cloned in the first place because the node collection does not contain phis (L239): > > https://github.com/openjdk/jdk/blob/c00cc8ffaee9bf9b3278d84afba0af2ac00134de/src/hotspot/share/opto/loopPredicate.cpp#L231-L245 Got it, great. I just looked for a matching `is_Phi` in your diff and did not find it. But it is already covered in existing code, great! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510862784 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510847099 From epeter at openjdk.org Mon Mar 4 09:40:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:40:56 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <2kIktGLLNDfXbXLEdk1nIKAhMK4_aoGTJAmjgoZj_2k=.ee1a00e8-95e2-4a42-b7d3-4bdb82d981a9@github.com> Message-ID: On Thu, 29 Feb 2024 16:00:15 GMT, Andrew Haley wrote: >> Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;) > >> Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;) > > We already have benchmarks, but the biggest win due to this change is the opportunity to reduce the load on the scoped value cache. > > At present, high performance depends on the per-thread cache, which is a 16-element OOP array. This is a fairly heavyweight structure for virtual threads, which otherwise have a very small heap footprint. With this optimization I think I can shrink the cache without significant loss of performance in most cases. I might also be able to move this cache to the carrier thread. > > So, this patch significantly moves the balance point in the space/speed tradeoff. @theRealAph that is exciting! It's a bit scary to have over 2000 lines for this optimization, makes it quite hard to review. But let's keep working on it ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1976131672 From chagedorn at openjdk.org Mon Mar 4 10:07:24 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 10:07:24 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v4] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: remove useless declaration, clone_nodes -> clone_data_nodes, add assertion to prevent double-usage of clone() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/79b8b270..14b46ba6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=02-03 Stats: 6 lines in 2 files changed: 1 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From epeter at openjdk.org Mon Mar 4 10:10:46 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 10:10:46 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v4] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 10:07:24 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > remove useless declaration, clone_nodes -> clone_data_nodes, add assertion to prevent double-usage of clone() Nice refactoring, looking forward to your next PR's on this ;) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1913911663 From chagedorn at openjdk.org Mon Mar 4 10:13:54 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 10:13:54 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v4] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 10:07:24 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > remove useless declaration, clone_nodes -> clone_data_nodes, add assertion to prevent double-usage of clone() Thanks Emanuel for your review and comments! I'll send it out after this one goes in :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18080#issuecomment-1976226338 From duke at openjdk.org Mon Mar 4 10:57:51 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 10:57:51 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture > For aarch64 do we need also change _.m4 files? Anything in aarch64_vector_ files? > > I will run our testing with current patch. I'm not clear about m4 file. How do we use it? The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1976309600 From duke at openjdk.org Mon Mar 4 11:07:52 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 11:07:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 21:05:11 GMT, Vladimir Kozlov wrote: > My testing shows that when we do **cross compilation** on linux-x64 I got: > > ``` > Warning: unused operand (no_rax_RegP) > ``` > > Normal linux-x64 build passed. > > The operand is used only in one place in ZGC barriers code: `src/hotspot/cpu/x86/gc/z/z_x86_64.ad` May be it is not include during cross compilation. Does the cross compilation disable zgc feature? In my test, it's used by zgc and no warning about it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1976329450 From duke at openjdk.org Mon Mar 4 11:12:04 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 11:12:04 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: Message-ID: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. > I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. kuaiwei has updated the pull request incrementally with one additional commit since the last revision: move no_rax_RegP from x86_64.ad to z_x86_64.ad and comment out immLRot2 in arm_32.ad ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18075/files - new: https://git.openjdk.org/jdk/pull/18075/files/faa8f949..29514638 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=01-02 Stats: 33 lines in 3 files changed: 12 ins; 12 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From duke at openjdk.org Mon Mar 4 11:12:05 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 11:12:05 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 23:29:53 GMT, Vladimir Kozlov wrote: > Testing build with moved `operand no_rax_RegP` passed. Please update changes with this: > > ``` > diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad > index d43929efd3e..aef3453b0b1 100644 > --- a/src/hotspot/cpu/x86/x86_64.ad > +++ b/src/hotspot/cpu/x86/x86_64.ad > @@ -2663,18 +2561,6 @@ operand rRegN() %{ > // the RBP is used as a proper frame pointer and is not included in ptr_reg. As a > // result, RBP is not included in the output of the instruction either. > > -operand no_rax_RegP() > -%{ > - constraint(ALLOC_IN_RC(ptr_no_rax_reg)); > - match(RegP); > - match(rbx_RegP); > - match(rsi_RegP); > - match(rdi_RegP); > - > - format %{ %} > - interface(REG_INTER); > -%} > - > // This operand is not allowed to use RBP even if > // RBP is not used to hold the frame pointer. > operand no_rbp_RegP() > diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > index d178805dfc7..0cc2ea03b35 100644 > --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > @@ -99,6 +99,18 @@ static void z_store_barrier(MacroAssembler& _masm, const MachNode* node, Address > > %} > > +operand no_rax_RegP() > +%{ > + constraint(ALLOC_IN_RC(ptr_no_rax_reg)); > + match(RegP); > + match(rbx_RegP); > + match(rsi_RegP); > + match(rdi_RegP); > + > + format %{ %} > + interface(REG_INTER); > +%} > + > // Load Pointer > instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) > %{ > ``` updated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1976336819 From epeter at openjdk.org Mon Mar 4 11:53:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 11:53:07 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" Message-ID: Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). ------------- Commit messages: - remove the assert - 8319690 Changes: https://git.openjdk.org/jdk/pull/18103/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18103&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8319690 Stats: 176 lines in 2 files changed: 172 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18103.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18103/head:pull/18103 PR: https://git.openjdk.org/jdk/pull/18103 From epeter at openjdk.org Mon Mar 4 11:53:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 11:53:07 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). @fg1417 wrote a first [PR](https://github.com/openjdk/jdk/pull/16991), but gave it up. I'm taking over the test, but not her fix. (I had found the original reproducer, but she improved the test further, so I want to give her credit for that) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18103#issuecomment-1976168583 From galder at openjdk.org Mon Mar 4 12:12:43 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 4 Mar 2024 12:12:43 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays In-Reply-To: References: Message-ID: On Thu, 8 Feb 2024 02:17:25 GMT, Dean Long wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > I think the right solution would be to add a line in `GraphBuilder::build_graph_for_intrinsic` for _clone, to append the IR for NewTypeArray and ArrayCopy as if we parsed newarray and arraycopy() from bytecodes. I'll see if I can get that working tomorrow. @dean-long any chance you could have another look at this? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1976440912 From roland at openjdk.org Mon Mar 4 12:35:56 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 4 Mar 2024 12:35:56 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 14:04:06 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/callGenerator.cpp line 854: > >> 852: >> 853: // Pattern matches: >> 854: // if ((objects = scopedValueCache()) != null) { > > Suggestion: > > // if (scopedValueCache() != null) { > > You don't use `objects` here, so it just confused me. I use snippets from the java code for `ScopedValue.get()` in the comments so it's easier to see what's being pattern matched. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1511095016 From chagedorn at openjdk.org Mon Mar 4 12:48:55 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 12:48:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:31:28 GMT, Emanuel Peter wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove dead declaration > > src/hotspot/share/opto/replacednodes.cpp line 211: > >> 209: } >> 210: // Map from current node to cloned/replaced node >> 211: OrigToNewHashtable clones(hash_table_size, hash_table_size); > > Nice. > Not your problem here. But should there not be a ResouceMark before this hashtable? There is one at the beginning of the function, but we create many of these hashtables in a loop, without any ResourceMarks in between reclaiming the memory... > A but then the hashmaps and stack/to_fix etc would allocate from the ResourceArea, but start at different ResourceMarks... bad idea. Hmm. As discussed offline, we should probably go over all uses of resource allocated things like `Node_Lists`, `Node_Stack` etc. at some point and check if there are missing resource marks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1511109952 From epeter at openjdk.org Mon Mar 4 13:32:16 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 13:32:16 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - allow only array stores of same type as container - mismatched access test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/8b3a2769..9e642aac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=11-12 Stats: 77 lines in 3 files changed: 58 ins; 0 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From epeter at openjdk.org Mon Mar 4 13:45:49 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 13:45:49 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v4] In-Reply-To: References: Message-ID: On Mon, 29 Jan 2024 12:00:32 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Add diagnostic flag MergeStores > > Great work, Emanuel. > > I think this is a well encapsulated optimization for a supposedly common code pattern requested by core libraries folks. I agree with Vladimir, that it would be nice to support this as part of the autovectorizer but that is probably not going to happen anytime soon. Until then, going with this separate phase would allow us to add support (and tests) for additional code patterns if requests come in and potentially move this to the autovectorizer later. @TobiHartmann @vnkozlov I now check for `AryPtr`. And I think that just marking with "mismatched" must be sufficient. Because if you do a unsafe store with a different memory size, then it is just marked as "mismatched" too. So if I now trigger bugs with this patch, then the bugs were pre-existing and could have been created using unsafe. For example a `StoreB` on an int array: `UNSAFE.putByte(a, UNSAFE.ARRAY_INT_BASE_OFFSET + 3, (byte)0xf4);` `74 StoreB === 42 64 73 70 [[ 16 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=4; mismatched unsafe Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact[0] *, idx=4; !jvms: Test::test8 @ bci:73 (line 78)` There is no barriers around these stores. Of course that would be very different on fields. Fields end up on different slices, and hence you would have to be more careful there. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1976611894 From rcastanedalo at openjdk.org Mon Mar 4 15:21:02 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 4 Mar 2024 15:21:02 GMT Subject: RFR: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() Message-ID: This changeset updates a comment in `G1BarrierSetC2::post_barrier()` to point to the relevant code that must be kept in sync. ------------- Commit messages: - Update comment Changes: https://git.openjdk.org/jdk/pull/18108/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18108&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327224 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18108/head:pull/18108 PR: https://git.openjdk.org/jdk/pull/18108 From duke at openjdk.org Mon Mar 4 15:26:52 2024 From: duke at openjdk.org (ExE Boss) Date: Mon, 4 Mar 2024 15:26:52 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: References: Message-ID: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> On Mon, 4 Mar 2024 13:32:16 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - allow only array stores of same type as container > - mismatched access test Do?we?also have tests that a?compiled method with merged stores like: static void storeLongLE(byte[] bytes, int offset, long value) { bytes[offset + 0] = (byte) (value >> 0); bytes[offset + 1] = (byte) (value >> 8); bytes[offset + 2] = (byte) (value >> 16); bytes[offset + 3] = (byte) (value >> 24); bytes[offset + 4] = (byte) (value >> 32); bytes[offset + 5] = (byte) (value >> 40); bytes[offset + 6] = (byte) (value >> 48); bytes[offset + 7] = (byte) (value >> 56); } still produce the?correct result even when only a?part of?the?stores fit?into the?array, e.g.: var arr = new byte[4]; try { // storeLongLE is already C2 compiled with merged stores: storeLongLE(arr, 0, -1L); throw new AssertionError("Expected ArrayIndexOutOfBoundsException"); } catch (ArrayIndexOutOfBoundsException _) { // ignore } assertTrue( Byte.toUnsignedInt(arr[0]) == 0xFF && Byte.toUnsignedInt(arr[1]) == 0xFF && Byte.toUnsignedInt(arr[2]) == 0xFF && Byte.toUnsignedInt(arr[3]) == 0xFF ); ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1976835427 From psandoz at openjdk.org Mon Mar 4 16:24:59 2024 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 4 Mar 2024 16:24:59 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v17] In-Reply-To: References: Message-ID: On Sat, 2 Mar 2024 16:22:22 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) >> >> >> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review resolutions. Marked as reviewed by psandoz (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1914752388 From duke at openjdk.org Mon Mar 4 16:47:48 2024 From: duke at openjdk.org (Yuri Gaevsky) Date: Mon, 4 Mar 2024 16:47:48 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2] In-Reply-To: References: Message-ID: <2RMDuzc9kW5ibUPCew-gaU_70OH_9c8hQYrrldWYrhQ=.6390d207-e989-48a0-a309-ccf9e6898018@github.com> On Thu, 25 Jan 2024 14:47:47 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with two additional commits since the last revision: > > - num_8b_elems_in_vec --> nof_vec_elems > - Removed checks for (MaxVectorSize >= 16) per @RealFYang suggestion. "Please keep me active" comment. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-1977022513 From aph at openjdk.org Mon Mar 4 16:55:52 2024 From: aph at openjdk.org (Andrew Haley) Date: Mon, 4 Mar 2024 16:55:52 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). Yes, that's right. I wrote that assertion for my own information, this isn't really a bug. I might rework this whole area of the compiler in the future, but there's no urgency. Thanks. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18103#pullrequestreview-1914822754 From dchuyko at openjdk.org Mon Mar 4 17:36:19 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Mon, 4 Mar 2024 17:36:19 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v28] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Deopt osr, cleanups - ... and 36 more: https://git.openjdk.org/jdk/compare/59529a92...a4578277 ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=27 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From kvn at openjdk.org Mon Mar 4 17:44:43 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 17:44:43 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> Message-ID: On Mon, 4 Mar 2024 11:12:04 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > move no_rax_RegP from x86_64.ad to z_x86_64.ad and comment out immLRot2 in arm_32.ad What are latest changes (commented `operand immLRot2()`) in `arm_32.ad` for? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977126409 From kvn at openjdk.org Mon Mar 4 17:44:43 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 17:44:43 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: <7QZA7VSO1iYpL2tFdMDWlz7qmehaYtNDdsWDoWWbbEo=.5f7e8dd4-0d18-4bf7-9244-650c970e0166@github.com> On Mon, 4 Mar 2024 11:05:27 GMT, kuaiwei wrote: >> The operand is used only in one place in ZGC barriers code: src/hotspot/cpu/x86/gc/z/z_x86_64.ad May be it is not include during cross compilation. > Does the cross compilation disable zgc feature? In my test, it's used by zgc and no warning about it. I am not sure what happened there in our testing and did not have time to investigate. The patch worked and it was enough for me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977130438 From kvn at openjdk.org Mon Mar 4 17:58:53 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 17:58:53 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> Message-ID: <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> On Mon, 4 Mar 2024 11:12:04 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > move no_rax_RegP from x86_64.ad to z_x86_64.ad and comment out immLRot2 in arm_32.ad @dholmes-ora asked to disable warning by default and I agree. We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977152941 From kvn at openjdk.org Mon Mar 4 18:20:53 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 18:20:53 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Mon, 4 Mar 2024 10:55:22 GMT, kuaiwei wrote: > I'm not clear about m4 file. How do we use it? > The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. Aarch64 m4 files are used to manually update .ad files. My concern was that could be overlapped code in m4 files which may overwrite your changes in .ad when someone do such manual update in a future. Fortunately `aarch64_ad.m4` does not have operand definitions so your changes are fine. But `aarch64_vector_ad.m4` has them so if we need to change `aarch64_vector.ad` we need to modify m4 file too. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977189798 From kvn at openjdk.org Mon Mar 4 18:23:44 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 18:23:44 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18103#pullrequestreview-1914996677 From epeter at openjdk.org Mon Mar 4 18:25:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 18:25:56 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> References: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> Message-ID: On Mon, 4 Mar 2024 15:24:23 GMT, ExE Boss wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - allow only array stores of same type as container >> - mismatched access test > > Do?we?also have tests that a?compiled method with merged stores like: > > static void storeLongLE(byte[] bytes, int offset, long value) { > bytes[offset + 0] = (byte) (value >> 0); > bytes[offset + 1] = (byte) (value >> 8); > bytes[offset + 2] = (byte) (value >> 16); > bytes[offset + 3] = (byte) (value >> 24); > bytes[offset + 4] = (byte) (value >> 32); > bytes[offset + 5] = (byte) (value >> 40); > bytes[offset + 6] = (byte) (value >> 48); > bytes[offset + 7] = (byte) (value >> 56); > } > > > still produce the?correct result even when only a?part of?the?stores fit?into the?array, e.g.: > > var arr = new byte[4]; > try { > // storeLongLE is already C2 compiled with merged stores: > storeLongLE(arr, 0, -1L); > > throw new AssertionError("Expected ArrayIndexOutOfBoundsException"); > } catch (ArrayIndexOutOfBoundsException _) { > // ignore > } > > assertTrue( > Byte.toUnsignedInt(arr[0]) == 0xFF > && Byte.toUnsignedInt(arr[1]) == 0xFF > && Byte.toUnsignedInt(arr[2]) == 0xFF > && Byte.toUnsignedInt(arr[3]) == 0xFF > ); @ExE-Boss I am working on such a test, thanks for the suggestion! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1977197535 From epeter at openjdk.org Mon Mar 4 18:33:09 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 18:33:09 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> References: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> Message-ID: On Mon, 4 Mar 2024 15:24:23 GMT, ExE Boss wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - allow only array stores of same type as container >> - mismatched access test > > Do?we?also have tests that a?compiled method with merged stores like: > > static void storeLongLE(byte[] bytes, int offset, long value) { > bytes[offset + 0] = (byte) (value >> 0); > bytes[offset + 1] = (byte) (value >> 8); > bytes[offset + 2] = (byte) (value >> 16); > bytes[offset + 3] = (byte) (value >> 24); > bytes[offset + 4] = (byte) (value >> 32); > bytes[offset + 5] = (byte) (value >> 40); > bytes[offset + 6] = (byte) (value >> 48); > bytes[offset + 7] = (byte) (value >> 56); > } > > > still produce the?correct result even when only a?part of?the?stores fit?into the?array, e.g.: > > var arr = new byte[4]; > try { > // storeLongLE is already C2 compiled with merged stores: > storeLongLE(arr, 0, -1L); > > throw new AssertionError("Expected ArrayIndexOutOfBoundsException"); > } catch (ArrayIndexOutOfBoundsException _) { > // ignore > } > > assertTrue( > Byte.toUnsignedInt(arr[0]) == 0xFF > && Byte.toUnsignedInt(arr[1]) == 0xFF > && Byte.toUnsignedInt(arr[2]) == 0xFF > && Byte.toUnsignedInt(arr[3]) == 0xFF > ); @ExE-Boss I have an example, but the IR rules are not yet passing. Need to investigate tomorrow. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1977206515 From epeter at openjdk.org Mon Mar 4 18:33:09 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 18:33:09 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v14] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: WIP test with out of bounds exception ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/9e642aac..638c80f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=12-13 Stats: 227 lines in 2 files changed: 142 ins; 0 del; 85 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From kvn at openjdk.org Mon Mar 4 18:37:50 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 18:37:50 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v14] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 18:33:09 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > WIP test with out of bounds exception This looks good now. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16245#pullrequestreview-1915022379 From sviswanathan at openjdk.org Mon Mar 4 20:24:56 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 4 Mar 2024 20:24:56 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Fri, 1 Mar 2024 06:09:30 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Update description of Poly1305 algo > > src/hotspot/cpu/x86/assembler_x86.cpp line 9115: > >> 9113: >> 9114: void Assembler::vpunpcklqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { >> 9115: assert(UseAVX > 0, "requires some form of AVX"); > > Add appropriate AVX512VL assertion The VL assertion is already being done as part of vex_prefix_and_encode() and vex_prefix() so no need to add it here. That's why we don't have this assertion in any of the AVX instructions which are promotable to EVEX e.g. vpadd, vpsub, etc. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511747533 From duke at openjdk.org Mon Mar 4 21:40:04 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Mar 2024 21:40:04 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v11] In-Reply-To: References: Message-ID: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: update asserts for vpmadd52l/hq ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17881/files - new: https://git.openjdk.org/jdk/pull/17881/files/b869d874..4a74a773 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=09-10 Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17881.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17881/head:pull/17881 PR: https://git.openjdk.org/jdk/pull/17881 From duke at openjdk.org Mon Mar 4 21:40:05 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Mar 2024 21:40:05 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> On Fri, 1 Mar 2024 17:02:35 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/assembler_x86.cpp line 5148: >> >>> 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >>> 5147: InstructionMark im(this); >>> 5148: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); >> >> uses_vl should be false here. >> >> BTW, this assertion looks very fuzzy, you are checking for two target features in one instruction, apparently, instruction is meant to use AVX512_IFMA only for 512 bit vector length, and for narrower vectors its needs AVX_IFMA. >> >> Lets either keep this strictly for AVX_IFMA for AVX512_IFMA we already have evpmadd52[l/h]uq, if you truly want to make this generic one then split the assertion >> >> `assert ( (avx_ifma && vector_len <= 256) || (avx512_ifma && (vector_len == 512 || VM_Version::support_vl())); >> ` >> >> And then you may pass uses_vl at true. > > It would be good to make this instruction generic. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830438 From duke at openjdk.org Mon Mar 4 21:40:05 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Mar 2024 21:40:05 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Fri, 1 Mar 2024 08:16:38 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Update description of Poly1305 algo > > src/hotspot/cpu/x86/assembler_x86.cpp line 5157: > >> 5155: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { >> 5156: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >> 5157: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); > > uses_vl should be false. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. > src/hotspot/cpu/x86/assembler_x86.cpp line 5183: > >> 5181: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >> 5182: InstructionMark im(this); >> 5183: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); > > uses_vl should be false. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. > src/hotspot/cpu/x86/assembler_x86.cpp line 5191: > >> 5189: >> 5190: void Assembler::vpmadd52huq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { >> 5191: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > > Same as above. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830567 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830720 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830942 From kbarrett at openjdk.org Tue Mar 5 00:00:52 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 5 Mar 2024 00:00:52 GMT Subject: RFR: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 15:04:57 GMT, Roberto Casta?eda Lozano wrote: > This changeset updates a comment in `G1BarrierSetC2::post_barrier()` to point to the relevant code that must be kept in sync. Looks good, and trivial. ------------- Marked as reviewed by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18108#pullrequestreview-1915593691 From duke at openjdk.org Tue Mar 5 00:08:05 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 5 Mar 2024 00:08:05 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs [v2] In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: unify the implementation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18089/files - new: https://git.openjdk.org/jdk/pull/18089/files/e8e3b9db..0401e18e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=00-01 Stats: 26 lines in 2 files changed: 0 ins; 25 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From jkarthikeyan at openjdk.org Tue Mar 5 03:32:01 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 5 Mar 2024 03:32:01 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v5] In-Reply-To: References: Message-ID: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Improve single benchmark, increase benchmark loop size ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17574/files - new: https://git.openjdk.org/jdk/pull/17574/files/b368c54d..76424e28 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=03-04 Stats: 17 lines in 1 file changed: 7 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17574/head:pull/17574 PR: https://git.openjdk.org/jdk/pull/17574 From jkarthikeyan at openjdk.org Tue Mar 5 04:10:47 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 5 Mar 2024 04:10:47 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> Message-ID: On Mon, 4 Mar 2024 08:59:46 GMT, Emanuel Peter wrote: > You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? Yeah, I was thinking it might be better to let the CMove transform happen first, since the conditions guarding both transforms are aiming to do the same thing in essence. My thought was that if the regression in your `testCostDifference` was fixed, it would be better to not have to do that fix in two different locations, since it impacts `is_minmax` as well. > BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday Thank you for linking this talk, it was really insightful! I also wonder if it would be possible to capture branch execution patterns somehow, to drive branch flattening optimizations. I figure it could be possible to keep track of the sequence of a branch's history of execution, and then compute some "entropy" value from that sequence to determine if there's a pattern, or if it's random and likely to be mispredicted. However, implementing that in practice sounds pretty difficult. @eme64 I've pushed a commit that fixes the benchmarks and sets the loop iteration count to 10_000. Could you check if this lets it vectorize on your machine? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1977932187 From dholmes at openjdk.org Tue Mar 5 04:32:50 2024 From: dholmes at openjdk.org (David Holmes) Date: Tue, 5 Mar 2024 04:32:50 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> Message-ID: On Mon, 4 Mar 2024 17:56:09 GMT, Vladimir Kozlov wrote: > @dholmes-ora asked to disable warning by default and I agree. What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on the detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977947572 From duke at openjdk.org Tue Mar 5 04:32:52 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 04:32:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> Message-ID: On Mon, 4 Mar 2024 17:40:13 GMT, Vladimir Kozlov wrote: > > What are latest changes (commented `operand immLRot2()`) in `arm_32.ad` for? I want to comment out immLRot2. It's mentioned in many todo comments. So I want to keep it without warning. I uses a wrong comment syntax. It's fixed in next patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977948363 From xliu at openjdk.org Tue Mar 5 05:14:45 2024 From: xliu at openjdk.org (Xin Liu) Date: Tue, 5 Mar 2024 05:14:45 GMT Subject: RFR: 8325681: C2 inliner rejects to inline a deeper callee because the methoddata of caller is immature. [v2] In-Reply-To: References: <_2EG8caEVf3BpvBIu40dwmD01ylzLNgazCIm90i_1Cc=.668bed4c-2814-4d59-9ced-3b441fe7a57a@github.com> <7v7ujChQujNMlcA4ZJweEzW0JCXx2Y_rloxPebHfvss=.f59e12af-f766-45a0-a7d0-0fd01748aec3@github.com> Message-ID: On Tue, 27 Feb 2024 19:42:25 GMT, Vladimir Ivanov wrote: >>> I don't think '_count field of ciCallProfile = -1' is correct for an immature method. it forces c2 to outline a call. C2 should make judgement based on the information HotSpot has collected. The real frequency is > than MinInlineFrequencyRatio. >> >> Indeed, that does look excessive. As another idea: utilize frequencies, but not type profiles at call sites with immature profiles. But then the next question is how representative the profile data at callee side then... > > Actually, there's a similar problematic scenario, now at allocation site: a local allocation followed by a long running loop. > > A a = factoryA::make(...); // new A(...) > for (int i = 0; i < large_count; i++) { > // ... a is eligible for scalarization ... > } > > Inlining `make` method is a prerequisite to scalarize `a`, but profiling data is so scarce and hard-to-gather (a sample per long-running loop), so it's impractical to wait until profiling is over. It's straightforward to prove that `make` frequency is 100% of total executions (since it dominates the loop), but absolute counts don't make it evident. I have 2 thoughts on this problem. 1) ArgEscape won't be a problem if we have stack-allocation. Even under profiling, it won't hurt. 2) for your case and mine, we can leverage iterative EA, BCEscapeAnalyzer and late-inlining. After EA analysis, C2 can have a map. An ArgEscape object maps to a list of function calls. As long as compiler still has budget, C2 can do late-inline for the cheapest obj. it will convert an Argscape to NonEscape. I feel only 1 is a general solution. For 2), it's hard to have a cost model. In your example, we probably need to inline 100 bytecodes(factorA::make) to make 'a' NonEscape. Bigger code may lose to one fast allocation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17957#discussion_r1512142246 From duke at openjdk.org Tue Mar 5 06:40:00 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 06:40:00 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: References: Message-ID: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. > I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. kuaiwei has updated the pull request incrementally with one additional commit since the last revision: 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18075/files - new: https://git.openjdk.org/jdk/pull/18075/files/29514638..5028086a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=02-03 Stats: 13 lines in 2 files changed: 4 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From duke at openjdk.org Tue Mar 5 06:47:48 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 06:47:48 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Mon, 4 Mar 2024 18:18:29 GMT, Vladimir Kozlov wrote: > > I'm not clear about m4 file. How do we use it? > > The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. > > Aarch64 m4 files are used to manually update .ad files. My concern was that could be overlapped code in m4 files which may overwrite your changes in .ad when someone do such manual update in a future. Fortunately `aarch64_ad.m4` does not have operand definitions so your changes are fine. But `aarch64_vector_ad.m4` has them so if we need to change `aarch64_vector.ad` we need to modify m4 file too. I checked all 44 removed operands in aarch64 and find none of them in m4 file. I just grep them and only "indOffI" and "indOffL" appeared because of "vmemA_indOffI4" and "vmemA_indOffL4" . So we need not change m4 file. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1978064822 From duke at openjdk.org Tue Mar 5 06:47:48 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 06:47:48 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> Message-ID: <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> On Tue, 5 Mar 2024 04:29:08 GMT, David Holmes wrote: >> @dholmes-ora asked to disable warning by default and I agree. >> >> We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` > >> @dholmes-ora asked to disable warning by default and I agree. > > What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on to detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. > @dholmes-ora asked to disable warning by default and I agree. > > We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` I added check of _disable_warnings in adlc. But not enable it in build script. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1978066503 From rcastanedalo at openjdk.org Tue Mar 5 06:59:48 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Mar 2024 06:59:48 GMT Subject: RFR: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 23:58:34 GMT, Kim Barrett wrote: > Looks good, and trivial. Thanks, Kim! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18108#issuecomment-1978077841 From rcastanedalo at openjdk.org Tue Mar 5 06:59:49 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Mar 2024 06:59:49 GMT Subject: Integrated: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 15:04:57 GMT, Roberto Casta?eda Lozano wrote: > This changeset updates a comment in `G1BarrierSetC2::post_barrier()` to point to the relevant code that must be kept in sync. This pull request has now been integrated. Changeset: 0b959098 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/0b959098be452aa2c9b461c921e11b19678138c7 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() Reviewed-by: kbarrett ------------- PR: https://git.openjdk.org/jdk/pull/18108 From gcao at openjdk.org Tue Mar 5 07:57:58 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 5 Mar 2024 07:57:58 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 Message-ID: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Hi, please review this patch that fix the minimal build failed for riscv. Error log for minimal build: Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 | ^~~~~~~~~~~~~ | MaxNewSize gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 gmake[3]: *** Waiting for unfinished jobs.... ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 gmake[2]: *** Waiting for unfinished jobs.... ^@ ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) === Output from failing command(s) repeated here === * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 | ^~~~~~~~~~~~~ | MaxNewSize * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. === End of repeated output === No indication of failed target found. HELP: Try searching the build log for '] Error'. HELP: Run 'make doctor' to diagnose build problems. make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 The root cause is that MaxVectorSize defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. Testing: - [x] linux-riscv minimal fastdebug native build ------------- Commit messages: - 8327283: RISC-V: Minimal build failed after JDK-8319716 Changes: https://git.openjdk.org/jdk/pull/18114/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327283 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From bkilambi at openjdk.org Tue Mar 5 08:16:11 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 5 Mar 2024 08:16:11 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: > Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. > > To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. > > With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. > > [AArch64] > On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. > > This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. > > No effects on other platforms. > > [Performance] > FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). > > ADDLanes > > Benchmark Before After Unit > FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms > > > Final code is as below: > > Before: > ` fadda z17.s, p7/m, z17.s, z16.s > ` > After: > > faddp v17.4s, v21.4s, v21.4s > faddp s18, v17.2s > fadd s18, s18, s19 > > > > > [Test] > Full jtreg passed on AArch64 and x86. > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 > [2] https://bugs.openjdk.org/browse/JDK-8275275 > [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments for changes in backend rules and code style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18034/files - new: https://git.openjdk.org/jdk/pull/18034/files/f8492ece..f8f79ac2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=00-01 Stats: 21 lines in 3 files changed: 10 ins; 3 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18034.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18034/head:pull/18034 PR: https://git.openjdk.org/jdk/pull/18034 From bkilambi at openjdk.org Tue Mar 5 08:23:50 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 5 Mar 2024 08:23:50 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Wed, 28 Feb 2024 08:32:52 GMT, Guoxiong Li wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments for changes in backend rules and code style > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 2891: > >> 2889: predicate((!VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n->in(2))) && >> 2890: !n->as_Reduction()->requires_strict_order()) || >> 2891: n->as_Reduction()->requires_strict_order()); > > This predication looks strange and complex. Can it be simplified to `!VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n->in(2))) || n->as_Reduction()->requires_strict_order()`? Hi, thanks for the suggestion. I agree it is a bit cumbersome but I felt it's easier to understand the various conditions on which these SVE instructions can be generated. Nevertheless, the suggested changes feel more compact. I made the changes in the new PS. > src/hotspot/share/opto/vectorIntrinsics.cpp line 1740: > >> 1738: Node* value = nullptr; >> 1739: if (mask == nullptr) { >> 1740: assert(!is_masked_op, "Masked op needs the mask value never null"); > > This assert may be missed after your refactor. But it seems not really matter. Yes, the conditions of `mask != nullptr` should take care of that. > src/hotspot/share/opto/vectornode.hpp line 242: > >> 240: virtual bool requires_strict_order() const { >> 241: return false; >> 242: }; > > The last semicolon is redundant. Done > src/hotspot/share/opto/vectornode.hpp line 265: > >> 263: class AddReductionVFNode : public ReductionNode { >> 264: private: >> 265: bool _requires_strict_order; // false in Vector API. > > The comment `false in Vector API` seems not so clean. We need to state the meaning of the field instead of one of its usages? Done > src/hotspot/share/opto/vectornode.hpp line 276: > >> 274: >> 275: virtual bool cmp(const Node& n) const { >> 276: return Node::cmp(n) && _requires_strict_order== ((ReductionNode&)n).requires_strict_order(); > > Need a space before `==` Done > src/hotspot/share/opto/vectornode.hpp line 297: > >> 295: >> 296: virtual bool cmp(const Node& n) const { >> 297: return Node::cmp(n) && _requires_strict_order== ((ReductionNode&)n).requires_strict_order(); > > Need a space before `==` Done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512315765 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512318336 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512319129 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512319002 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512318763 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512318536 From epeter at openjdk.org Tue Mar 5 08:43:04 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 08:43:04 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v15] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix test for trapping examples ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/638c80f4..4a3ee855 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=13-14 Stats: 27 lines in 1 file changed: 1 ins; 13 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From gli at openjdk.org Tue Mar 5 09:55:47 2024 From: gli at openjdk.org (Guoxiong Li) Date: Tue, 5 Mar 2024 09:55:47 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18103#pullrequestreview-1916417828 From roberto.castaneda.lozano at oracle.com Tue Mar 5 10:19:40 2024 From: roberto.castaneda.lozano at oracle.com (Roberto Castaneda Lozano) Date: Tue, 5 Mar 2024 10:19:40 +0000 Subject: A case where G1/Shenandoah satb barrier is not optimized? In-Reply-To: <4d7f6d11-824b-47d0-8419-06694f695745.yude.lyd@alibaba-inc.com> References: <4d7f6d11-824b-47d0-8419-06694f695745.yude.lyd@alibaba-inc.com> Message-ID: Hi Yude Yin (including hotspot-compiler-dev mailing list), >From what I read in the original JBS issue [1], the g1_can_remove_pre_barrier/g1_can_remove_post_barrier optimization targets writes within simple constructors (such as that of Node within java.util.HashMap [2]), and seems to assume that the situation you describe (several writes to the same field) is either uncommon within this scope or can be reduced by the compiler into a form that is optimizable. In your example, one would hope that the compiler proves that 'ref = a' is redundant and optimizes it away (which would lead to removing all barriers), but this optimization is inhibited by the barrier operations inserted by the compiler in its intermediate representation. These limitations will become easier to overcome with the "Late G1 Barrier Expansion" JEP (in draft status), which proposes hiding barrier code from the compiler's transformations and optimizations [3]. In fact, our current "Late G1 Barrier Expansion" prototype does optimize 'ref = a' away, and removes all barriers in your example. Cheers, Roberto [1] https://bugs.openjdk.org/browse/JDK-8057737 [2] https://github.com/openjdk/jdk/blob/e9adcebaf242843fe2004b01747b5a930b62b291/src/java.base/share/classes/java/util/HashMap.java#L287-L292 [3] https://bugs.openjdk.org/browse/JDK-8322295 ________________________________________ From: hotspot-gc-dev on behalf of Yude Lin Sent: Monday, March 4, 2024 11:32 AM To: hotspot-gc-dev Subject: A case where G1/Shenandoah satb barrier is not optimized? Hi Dear GC devs, I found a case where GC barriers cannot be optimized out. I wonder if anyone could enlighten me on this code: > G1BarrierSetC2::g1_can_remove_pre_barrier (or ShenandoahBarrierSetC2::satb_can_remove_pre_barrier) where there is a condition: > (captured_store == nullptr || captured_store == st_init->zero_memory()) on the store that can be optimized out. The comment says: > The compiler needs to determine that the object in which a field is about > to be written is newly allocated, and that no prior store to the same field > has happened since the allocation. But my understanding is satb barriers of any number of stores immediately (i.e., no in-between safepoints) after an allocation can be optimized out, same field or not. The "no prior store" condition confuses me. What's more, failing to optimize one satb barrier will prevent further barrier optimization that otherwise would be done (maybe due to control flow complexity from the satb barrier). An example would be: public static class TwoFieldObject { public Object ref; public Object ref2; public TwoFieldObject(Object a) { ref = a; } } public static Object testWrite(Object a, Object b, Object c) { TwoFieldObject tfo = new TwoFieldObject(a); tfo.ref = b; // satb barrier of this store cannot be optimized out, and because of its existence, post barrier will also not be optimized out tfo.ref2 = c; // because of the previous store's barriers, pre/post barriers of this store will not be optimized out return tfo; } From rkennke at openjdk.org Tue Mar 5 11:12:57 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 5 Mar 2024 11:12:57 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 Message-ID: A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. ------------- Commit messages: - 8327361: Update some comments after JDK-8139457 Changes: https://git.openjdk.org/jdk/pull/18120/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18120&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327361 Stats: 12 lines in 2 files changed: 0 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/18120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18120/head:pull/18120 PR: https://git.openjdk.org/jdk/pull/18120 From epeter at openjdk.org Tue Mar 5 11:17:47 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 11:17:47 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> Message-ID: <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> On Tue, 5 Mar 2024 04:08:19 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on CMove directly >> >> You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? I guess that would allow you to get better types, without having to deal with all the CMove-vs-branch-prediction heuristics. >> >> BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday: >> `Branchless Programming in C++ - Fedor Pikus - CppCon 2021` >> https://www.youtube.com/watch?v=g-WPhYREFjk >> >> My conclusion from that: it is really hard to say ahead of time if the branch-predictor is successful. It depends on how predictable a condition is. The branch-predictor can see patterns (like alternating true-false). So even if a probability is 50% on a branch, it may be fully predictable, and branching code is much more efficient than branchless code. But in totally random cases, branchless code may be faster because you will have a large percentage of mispredictions, and mispredictions are expensive. But in both cases you would see `iff->_prob = 0.5`. >> Really what we would need is profiling that checks how much a branch was `mispredicted`, and not how much it was `taken`. But not sure if we can even get that profiling data. > >> You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? > > Yeah, I was thinking it might be better to let the CMove transform happen first, since the conditions guarding both transforms are aiming to do the same thing in essence. My thought was that if the regression in your `testCostDifference` was fixed, it would be better to not have to do that fix in two different locations, since it impacts `is_minmax` as well. > >> BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday > > Thank you for linking this talk, it was really insightful! I also wonder if it would be possible to capture branch execution patterns somehow, to drive branch flattening optimizations. I figure it could be possible to keep track of the sequence of a branch's history of execution, and then compute some "entropy" value from that sequence to determine if there's a pattern, or if it's random and likely to be mispredicted. However, implementing that in practice sounds pretty difficult. > > @eme64 I've pushed a commit that fixes the benchmarks and sets the loop iteration count to 10_000. Could you check if this lets it vectorize on your machine? Thanks! @jaskarth Why don't you first make the code change with starting from a `Cmp -> CMove` pattern rather than the `Cmp -> If -> Phi` pattern. Then I can look at both things together ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1978523865 From duke at openjdk.org Tue Mar 5 11:32:47 2024 From: duke at openjdk.org (Swati Sharma) Date: Tue, 5 Mar 2024 11:32:47 GMT Subject: RFR: 8326421: Add jtreg test for large arrayCopy disjoint case. In-Reply-To: References: Message-ID: On Fri, 23 Feb 2024 18:56:48 GMT, Swati Sharma wrote: > There is already a large suite of arraycopy tests here: https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/arraycopy/stress > > Any reason for not extending that one instead? Hi @shipilev , I tried to extend in the stress framework, below are few points which I observed - As all the test with different primitive types are initializing the orig and test array with MAX_SIZE parameter, so there is no other way to increase the size of array. I tried increasing the MAX_SIZE from 128K to 3MB to cover the test points and that increased the test execution time from 2 minutes to 4 minutes. - The testWith method is using MAX_SIZE parameter to define the array so defining a new array of 4MB size requires to add new test method for all types which I think would duplicate the code. - Current test takes few seconds to execute for large size and has very pointed length test cases instead of random length with both aligned and unaligned cases for byte type. Swati ------------- PR Comment: https://git.openjdk.org/jdk/pull/17962#issuecomment-1978548219 From galder at openjdk.org Tue Mar 5 11:36:46 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 5 Mar 2024 11:36:46 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:06:48 GMT, Roman Kennke wrote: > A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. I think the changes look fine, but looking closer to the original PR, src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.hpp might also need adjusting. s390 and ppc are probably just fine. ------------- Changes requested by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/18120#pullrequestreview-1916637704 From rkennke at openjdk.org Tue Mar 5 11:41:20 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 5 Mar 2024 11:41:20 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: > A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: RISCV changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18120/files - new: https://git.openjdk.org/jdk/pull/18120/files/a14c0c9c..2da3ee69 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18120&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18120&range=00-01 Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/18120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18120/head:pull/18120 PR: https://git.openjdk.org/jdk/pull/18120 From gli at openjdk.org Tue Mar 5 12:16:49 2024 From: gli at openjdk.org (Guoxiong Li) Date: Tue, 5 Mar 2024 12:16:49 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style Looks good except the comment. src/hotspot/share/opto/vectornode.hpp line 268: > 266: // The value is true when add reduction for floats is auto-vectorized as auto-vectorization > 267: // mandates strict ordering but the value is false when this node is generated through VectorAPI > 268: // as VectorAPI does not impose any such rules on ordering. The comment can be more better. But I leave it to a reviewer who is proficient in english to help you improve it. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18034#pullrequestreview-1916715388 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512719793 From gcao at openjdk.org Tue Mar 5 12:41:45 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 5 Mar 2024 12:41:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build @robehn Could you please take a look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1978686934 From rehn at openjdk.org Tue Mar 5 12:56:45 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 5 Mar 2024 12:56:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) The MaxVectorSize is defined if JVMCI and/or C2 is defined: `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1978713558 From gcao at openjdk.org Tue Mar 5 13:15:46 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 5 Mar 2024 13:15:46 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build > The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) > > The MaxVectorSize is defined if JVMCI and/or C2 is defined: `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) > > The MaxVectorSize is defined if JVMCI and/or C2 is defined: `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` Yes, You are right. I've considered this way putting the function definition under a ifdef COMPILER2_OR_JVMCI block. But I find that no other CPU does this. I am not sure if there is any other reason for this. But I can do that if we all think it's better. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1978751082 From jbhateja at openjdk.org Tue Mar 5 13:15:47 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Mar 2024 13:15:47 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 00:08:05 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > unify the implementation Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1916872568 From jbhateja at openjdk.org Tue Mar 5 13:16:48 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Mar 2024 13:16:48 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v11] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 21:40:04 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > update asserts for vpmadd52l/hq Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1916876320 From epeter at openjdk.org Tue Mar 5 13:39:51 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 13:39:51 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 18:20:41 GMT, Vladimir Kozlov wrote: >> Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. >> >> I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). >> >> It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. >> >> But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). > > Good. Thanks @vnkozlov @theRealAph @lgxbslgx for the reviews! Thanks @fg1417 for the original PR and patching up the test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18103#issuecomment-1978790138 From epeter at openjdk.org Tue Mar 5 13:39:53 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 13:39:53 GMT Subject: Integrated: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). This pull request has now been integrated. Changeset: 98f0b866 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/98f0b86641d84048949ed3da1cb14f3820b01c12 Stats: 176 lines in 2 files changed: 172 ins; 4 del; 0 mod 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" Co-authored-by: Fei Gao Reviewed-by: aph, kvn, gli ------------- PR: https://git.openjdk.org/jdk/pull/18103 From jbhateja at openjdk.org Tue Mar 5 14:13:48 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Mar 2024 14:13:48 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> Message-ID: <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> On Mon, 4 Mar 2024 21:36:36 GMT, Srinivas Vamsi Parasa wrote: >> It would be good to make this instruction generic. > > Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. [poly1305_spr_validation.patch](https://github.com/openjdk/jdk/files/14496404/poly1305_spr_validation.patch) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1512889086 From epeter at openjdk.org Tue Mar 5 14:32:02 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 14:32:02 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v16] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: - Merge branch 'master' into JDK-8318446 - fix test for trapping examples - WIP test with out of bounds exception - allow only array stores of same type as container - mismatched access test - add test300 - make it happen in post_loop_opts - fix invalid case - cosmetic fixes - New version with ArrayPointer - ... and 36 more: https://git.openjdk.org/jdk/compare/98f0b866...07c233fb ------------- Changes: https://git.openjdk.org/jdk/pull/16245/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=15 Stats: 2391 lines in 13 files changed: 2387 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From epeter at openjdk.org Tue Mar 5 15:29:01 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 15:29:01 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v16] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 14:32:02 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: > > - Merge branch 'master' into JDK-8318446 > - fix test for trapping examples > - WIP test with out of bounds exception > - allow only array stores of same type as container > - mismatched access test > - add test300 > - make it happen in post_loop_opts > - fix invalid case > - cosmetic fixes > - New version with ArrayPointer > - ... and 36 more: https://git.openjdk.org/jdk/compare/98f0b866...07c233fb A blocking issue is now integrated and merged: https://github.com/openjdk/jdk/pull/18103 ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1979030281 From epeter at openjdk.org Tue Mar 5 15:55:12 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 15:55:12 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: Message-ID: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: a little bit of casting for debug printing code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/07c233fb..796d9508 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=15-16 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From ddong at openjdk.org Tue Mar 5 15:58:54 2024 From: ddong at openjdk.org (Denghui Dong) Date: Tue, 5 Mar 2024 15:58:54 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag Message-ID: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Hi, Please help review this change that makes TimeLinearScan a develop flag. Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. ------------- Commit messages: - 8327379: Make TimeLinearScan a develop flag Changes: https://git.openjdk.org/jdk/pull/18125/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18125&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327379 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18125/head:pull/18125 PR: https://git.openjdk.org/jdk/pull/18125 From ddong at openjdk.org Tue Mar 5 16:13:09 2024 From: ddong at openjdk.org (Denghui Dong) Date: Tue, 5 Mar 2024 16:13:09 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: > Hi, > > Please help review this change that makes TimeLinearScan a develop flag. > > Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: update header ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18125/files - new: https://git.openjdk.org/jdk/pull/18125/files/6706a1e9..a242dc19 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18125&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18125&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18125/head:pull/18125 PR: https://git.openjdk.org/jdk/pull/18125 From epeter at openjdk.org Tue Mar 5 16:48:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 16:48:58 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout Message-ID: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> This is a regression fix from https://github.com/openjdk/jdk/pull/17657. I had never encountered an example where a data node in the loop body did not have any input node in the loop. My assumption was that this should never happen, such a node should move out of the loop itself. I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 I now had a few options: 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. ------------- Commit messages: - the fix - 8327172 Changes: https://git.openjdk.org/jdk/pull/18123/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18123&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327172 Stats: 103 lines in 3 files changed: 100 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18123.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18123/head:pull/18123 PR: https://git.openjdk.org/jdk/pull/18123 From chagedorn at openjdk.org Tue Mar 5 17:01:45 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Mar 2024 17:01:45 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 14:53:33 GMT, Emanuel Peter wrote: > This is a regression fix from https://github.com/openjdk/jdk/pull/17657. > > I had never encountered an example where a data node in the loop body did not have any input node in the loop. > My assumption was that this should never happen, such a node should move out of the loop itself. > > I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. > > https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 > > I now had a few options: > 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. > 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. > 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. > > I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. That looks reasonable. I agree to fix the ctrl issues separately and go with a bailout solution for now. Maybe you want to add a note at [JDK-8307982](https://bugs.openjdk.org/browse/JDK-8307982) to not forget about this case here. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18123#pullrequestreview-1917612473 From kvn at openjdk.org Tue Mar 5 17:02:47 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Mar 2024 17:02:47 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> References: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> Message-ID: On Tue, 5 Mar 2024 06:40:00 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad This looks good now. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1917613973 From sviswanathan at openjdk.org Tue Mar 5 18:51:45 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 5 Mar 2024 18:51:45 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 00:08:05 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > unify the implementation Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1917873194 From kvn at openjdk.org Tue Mar 5 18:55:46 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Mar 2024 18:55:46 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 14:53:33 GMT, Emanuel Peter wrote: > This is a regression fix from https://github.com/openjdk/jdk/pull/17657. > > I had never encountered an example where a data node in the loop body did not have any input node in the loop. > My assumption was that this should never happen, such a node should move out of the loop itself. > > I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. > > https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 > > I now had a few options: > 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. > 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. > 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. > > I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18123#pullrequestreview-1917880586 From dlong at openjdk.org Tue Mar 5 22:40:47 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 5 Mar 2024 22:40:47 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 00:08:05 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > unify the implementation So if we can still generate the non-AVX encoding of `roundsd dst, src, mode` isn't there still a false dependency problem with `dst`? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1979755885 From ksakata at openjdk.org Wed Mar 6 00:33:45 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Wed, 6 Mar 2024 00:33:45 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES Message-ID: This pull request removes an unnecessary directive. There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. I built OpenJDK with Zero VM as a test. It was successful. $ ./configure --with-jvm-variants=zero --enable-debug $ make images $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version openjdk version "23-internal" 2024-09-17 OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. ------------- Commit messages: - Remove DONT_USE_REGISTER_DEFINES Changes: https://git.openjdk.org/jdk/pull/18115/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18115&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8323242 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18115/head:pull/18115 PR: https://git.openjdk.org/jdk/pull/18115 From gli at openjdk.org Wed Mar 6 00:33:45 2024 From: gli at openjdk.org (Guoxiong Li) Date: Wed, 6 Mar 2024 00:33:45 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. Looks good. Some related issues: [JDK-8269122](https://bugs.openjdk.org/browse/JDK-8269122) [JDK-8282085](https://bugs.openjdk.org/browse/JDK-8282085) [JDK-8200168](https://bugs.openjdk.org/browse/JDK-8200168) [JDK-8297445](https://bugs.openjdk.org/browse/JDK-8297445) Please fix the title of the issue or this PR. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18115#pullrequestreview-1917283298 PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-1978967341 From duke at openjdk.org Wed Mar 6 02:25:51 2024 From: duke at openjdk.org (Joshua Cao) Date: Wed, 6 Mar 2024 02:25:51 GMT Subject: RFR: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 Message-ID: As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. Passing hotspot tier1 locally on my Linux machine. ------------- Commit messages: - 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 Changes: https://git.openjdk.org/jdk/pull/18130/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18130&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327201 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18130.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18130/head:pull/18130 PR: https://git.openjdk.org/jdk/pull/18130 From amitkumar at openjdk.org Wed Mar 6 02:42:46 2024 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 6 Mar 2024 02:42:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> Message-ID: On Tue, 5 Mar 2024 06:44:44 GMT, kuaiwei wrote: >>> @dholmes-ora asked to disable warning by default and I agree. >> >> What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on to detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. > >> @dholmes-ora asked to disable warning by default and I agree. >> >> We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` > > I added check of _disable_warnings in adlc. But not enable it in build script. @kuaiwei you need one more approval from **R**eviewer, before integrating hotspot change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1979975733 From duke at openjdk.org Wed Mar 6 04:03:45 2024 From: duke at openjdk.org (kuaiwei) Date: Wed, 6 Mar 2024 04:03:45 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> Message-ID: On Tue, 5 Mar 2024 06:44:44 GMT, kuaiwei wrote: >>> @dholmes-ora asked to disable warning by default and I agree. >> >> What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on to detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. > >> @dholmes-ora asked to disable warning by default and I agree. >> >> We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` > > I added check of _disable_warnings in adlc. But not enable it in build script. > @kuaiwei you need one more approval from **R**eviewer, before integrating hotspot change. Ok , I will wait for another review. May I rollback the integrate request? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1980048043 From gli at openjdk.org Wed Mar 6 04:08:46 2024 From: gli at openjdk.org (Guoxiong Li) Date: Wed, 6 Mar 2024 04:08:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> Message-ID: On Wed, 6 Mar 2024 04:01:05 GMT, kuaiwei wrote: > Ok , I will wait for another review. May I rollback the integrate request? You can use the command `/reviewers 2 reviewer` to impose restriction. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1980052009 From vlivanov at openjdk.org Wed Mar 6 04:31:45 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 6 Mar 2024 04:31:45 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> References: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> Message-ID: <3v2QNFwDKc2i5JDyiacYG6z4uLu4kUdvdjZGdzmryo4=.e84c917a-4379-4102-856c-b929a0ea384b@github.com> On Tue, 5 Mar 2024 06:40:00 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1918722448 From vlivanov at openjdk.org Wed Mar 6 04:35:46 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 6 Mar 2024 04:35:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: <_kIYOZBn9wfHo5YoIxkOul4P0sZJXEZ4fAbsczPxy_Q=.4f13a415-fc2e-4f1f-86c3-e648537c2abf@github.com> On Mon, 4 Mar 2024 18:18:29 GMT, Vladimir Kozlov wrote: >>> For aarch64 do we need also change _.m4 files? Anything in aarch64_vector_ files? >>> >>> I will run our testing with current patch. >> >> I'm not clear about m4 file. How do we use it? >> The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. > >> I'm not clear about m4 file. How do we use it? >> The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. > > Aarch64 m4 files are used to manually update .ad files. My concern was that could be overlapped code in m4 files which may overwrite your changes in .ad when someone do such manual update in a future. > Fortunately `aarch64_ad.m4` does not have operand definitions so your changes are fine. > But `aarch64_vector_ad.m4` has them so if we need to change `aarch64_vector.ad` we need to modify m4 file too. No need to retract integration request. As the bot reported earlier, you need a Commiter to sponsor the PR. But, please, wait until @vnkozlov confirms that testing results are good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1980071400 From gcao at openjdk.org Wed Mar 6 05:15:05 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 05:15:05 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array Message-ID: Hi, I noticed that RISC-V missed this change from https://github.com/openjdk/jdk/pull/11044 , comments as follow [1]: `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 ### Tests - [x] Run tier1-3 tests on SiFive unmatched (release) ------------- Commit messages: - 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array Changes: https://git.openjdk.org/jdk/pull/18131/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18131&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327426 Stats: 16 lines in 1 file changed: 6 ins; 7 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18131.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18131/head:pull/18131 PR: https://git.openjdk.org/jdk/pull/18131 From jkarthikeyan at openjdk.org Wed Mar 6 06:13:02 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 6 Mar 2024 06:13:02 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: References: Message-ID: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Change transform to work on CMoves ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17574/files - new: https://git.openjdk.org/jdk/pull/17574/files/76424e28..2adebb73 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=04-05 Stats: 155 lines in 3 files changed: 78 ins; 69 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/17574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17574/head:pull/17574 PR: https://git.openjdk.org/jdk/pull/17574 From jkarthikeyan at openjdk.org Wed Mar 6 06:13:02 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 6 Mar 2024 06:13:02 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> Message-ID: On Tue, 5 Mar 2024 11:14:51 GMT, Emanuel Peter wrote: >>> You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? >> >> Yeah, I was thinking it might be better to let the CMove transform happen first, since the conditions guarding both transforms are aiming to do the same thing in essence. My thought was that if the regression in your `testCostDifference` was fixed, it would be better to not have to do that fix in two different locations, since it impacts `is_minmax` as well. >> >>> BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday >> >> Thank you for linking this talk, it was really insightful! I also wonder if it would be possible to capture branch execution patterns somehow, to drive branch flattening optimizations. I figure it could be possible to keep track of the sequence of a branch's history of execution, and then compute some "entropy" value from that sequence to determine if there's a pattern, or if it's random and likely to be mispredicted. However, implementing that in practice sounds pretty difficult. >> >> @eme64 I've pushed a commit that fixes the benchmarks and sets the loop iteration count to 10_000. Could you check if this lets it vectorize on your machine? Thanks! > > @jaskarth Why don't you first make the code change with starting from a `Cmp -> CMove` pattern rather than the `Cmp -> If -> Phi` pattern. Then I can look at both things together ;) @eme64 Sure, I've updated the patch accordingly :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1980154585 From gli at openjdk.org Wed Mar 6 06:19:49 2024 From: gli at openjdk.org (Guoxiong Li) Date: Wed, 6 Mar 2024 06:19:49 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: On Tue, 5 Mar 2024 16:13:09 GMT, Denghui Dong wrote: >> Hi, >> >> Please help review this change that makes TimeLinearScan a develop flag. >> >> Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. > > Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: > > update header The patch looks good. But I don't really know whether it deserves to do that. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18125#pullrequestreview-1918830275 From chagedorn at openjdk.org Wed Mar 6 07:07:44 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 07:07:44 GMT Subject: RFR: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 02:21:10 GMT, Joshua Cao wrote: > As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. > > Passing hotspot tier1 locally on my Linux machine. Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18130#pullrequestreview-1918916720 From roberto.castaneda.lozano at oracle.com Wed Mar 6 08:58:36 2024 From: roberto.castaneda.lozano at oracle.com (Roberto Castaneda Lozano) Date: Wed, 6 Mar 2024 08:58:36 +0000 Subject: [External] : Re: A case where G1/Shenandoah satb barrier is not optimized? In-Reply-To: References: <4d7f6d11-824b-47d0-8419-06694f695745.yude.lyd@alibaba-inc.com>, Message-ID: The JEP is still in draft mode, so no targeted JDK release yet. My hope is that it will be accepted as a Candidate JEP in the upcoming weeks. Cheers, Roberto ________________________________________ From: Yude Lin Sent: Wednesday, March 6, 2024 9:22 AM To: Roberto Castaneda Lozano; hotspot-gc-dev; hotspot-compiler-dev at openjdk.org Subject: [External] : Re: A case where G1/Shenandoah satb barrier is not optimized? Thanks Roberto and Thomas. By the way, is late barrier expansion aiming at a certain JDK release? Cheers, Yude ------------------------------------------------------------------ From:Roberto Castaneda Lozano Send Time:2024$BG/(B3$B7n(B5$BF|(B($B at 14|Fs(B) 18:19 To:hotspot-gc-dev ; $BNS0iy~(B($B8fy~(B) ; hotspot-compiler-dev at openjdk.org Subject:Re: A case where G1/Shenandoah satb barrier is not optimized? Hi Yude Yin (including hotspot-compiler-dev mailing list), >From what I read in the original JBS issue [1], the g1_can_remove_pre_barrier/g1_can_remove_post_barrier optimization targets writes within simple constructors (such as that of Node within java.util.HashMap [2]), and seems to assume that the situation you describe (several writes to the same field) is either uncommon within this scope or can be reduced by the compiler into a form that is optimizable. In your example, one would hope that the compiler proves that 'ref = a' is redundant and optimizes it away (which would lead to removing all barriers), but this optimization is inhibited by the barrier operations inserted by the compiler in its intermediate representation. These limitations will become easier to overcome with the "Late G1 Barrier Expansion" JEP (in draft status), which proposes hiding barrier code from the compiler's transformations and optimizations [3]. In fact, our current "Late G1 Barrier Expansion" prototype does optimize 'ref = a' away, and removes all barriers in your example. Cheers, Roberto [1] https://bugs.openjdk.org/browse/JDK-8057737 [2] https://github.com/openjdk/jdk/blob/e9adcebaf242843fe2004b01747b5a930b62b291/src/java.base/share/classes/java/util/HashMap.java#L287-L292 [3] https://bugs.openjdk.org/browse/JDK-8322295 ________________________________________ From: hotspot-gc-dev on behalf of Yude Lin Sent: Monday, March 4, 2024 11:32 AM To: hotspot-gc-dev Subject: A case where G1/Shenandoah satb barrier is not optimized? Hi Dear GC devs, I found a case where GC barriers cannot be optimized out. I wonder if anyone could enlighten me on this code: > G1BarrierSetC2::g1_can_remove_pre_barrier (or ShenandoahBarrierSetC2::satb_can_remove_pre_barrier) where there is a condition: > (captured_store == nullptr || captured_store == st_init->zero_memory()) on the store that can be optimized out. The comment says: > The compiler needs to determine that the object in which a field is about > to be written is newly allocated, and that no prior store to the same field > has happened since the allocation. But my understanding is satb barriers of any number of stores immediately (i.e., no in-between safepoints) after an allocation can be optimized out, same field or not. The "no prior store" condition confuses me. What's more, failing to optimize one satb barrier will prevent further barrier optimization that otherwise would be done (maybe due to control flow complexity from the satb barrier). An example would be: public static class TwoFieldObject { public Object ref; public Object ref2; public TwoFieldObject(Object a) { ref = a; } } public static Object testWrite(Object a, Object b, Object c) { TwoFieldObject tfo = new TwoFieldObject(a); tfo.ref = b; // satb barrier of this store cannot be optimized out, and because of its existence, post barrier will also not be optimized out tfo.ref2 = c; // because of the previous store's barriers, pre/post barriers of this store will not be optimized out return tfo; } From roland at openjdk.org Wed Mar 6 09:00:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 09:00:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Tue, 5 Mar 2024 15:55:12 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > a little bit of casting for debug printing code Do you intend to add an IR test case? src/hotspot/share/opto/compile.cpp line 2927: > 2925: } > 2926: > 2927: void Compile::gather_nodes_for_merge_stores(PhaseIterGVN &igvn) { This is going away, right? src/hotspot/share/opto/memnode.cpp line 2802: > 2800: StoreNode* use = can_merge_primitive_array_store_with_use(phase, true); > 2801: if (use != nullptr) { > 2802: return nullptr; Do you want to assert that the use is in the igvn worklist? src/hotspot/share/opto/memnode.cpp line 2971: > 2969: // The goal is to check if two such ArrayPointers are adjacent for a load or store. > 2970: // > 2971: // Note: we accumulate all constant offsets into constant_offset, even the int constant behind Is this really needed? For the patterns of interest, aren't the constant pushed down the chain of `AddP` nodes so the address is `(AddP base (AddP ...) constant)`? src/hotspot/share/opto/memnode.cpp line 3146: > 3144: Node* ctrl_s1 = s1->in(MemNode::Control); > 3145: Node* ctrl_s2 = s2->in(MemNode::Control); > 3146: if (ctrl_s1 != ctrl_s2) { Do you need to check that `ctrl_s1` and `ctrl_s2` are not null? I suppose this could be called on a dying part of the graph during igvn. src/hotspot/share/opto/memnode.cpp line 3154: > 3152: } > 3153: ProjNode* other_proj = ctrl_s1->as_IfProj()->other_if_proj(); > 3154: if (other_proj->is_uncommon_trap_proj(Deoptimization::Reason_range_check) == nullptr || This could be a range check for an unrelated array I suppose. Does it matter? src/hotspot/share/opto/memnode.hpp line 578: > 576: > 577: Node* Ideal_merge_primitive_array_stores(PhaseGVN* phase); > 578: StoreNode* can_merge_primitive_array_store_with_use(PhaseGVN* phase, bool check_def); If I understand correctly you need the `check_def ` parameter to avoid having `can_merge_primitive_array_store_with_use` and `can_merge_primitive_array_store_with_def` call each other indefinitely. But if I was to write new code that takes advantage of one of the two methods, I think I would be puzzled that there's a `check_def` parameter. Passing `false` would be wrong then but maybe not immediately obvious. Maybe it would be better to have `can_merge_primitive_array_store_with_def` with no `check_def` parameter and have all the work done in a utility method that takes a `check_def` parameter (always `true` when called from `can_merge_primitive_array_store_with_def`) ------------- PR Review: https://git.openjdk.org/jdk/pull/16245#pullrequestreview-1919069208 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514032470 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514033069 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514071393 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514057889 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514062586 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514051769 From kvn at openjdk.org Wed Mar 6 09:12:46 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Mar 2024 09:12:46 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: On Tue, 5 Mar 2024 16:13:09 GMT, Denghui Dong wrote: >> Hi, >> >> Please help review this change that makes TimeLinearScan a develop flag. >> >> Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. > > Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: > > update header Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18125#pullrequestreview-1919164251 From shade at openjdk.org Wed Mar 6 09:21:51 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 6 Mar 2024 09:21:51 GMT Subject: RFR: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 02:21:10 GMT, Joshua Cao wrote: > As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. > > Passing hotspot tier1 locally on my Linux machine. Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18130#pullrequestreview-1919183962 From duke at openjdk.org Wed Mar 6 09:21:51 2024 From: duke at openjdk.org (Joshua Cao) Date: Wed, 6 Mar 2024 09:21:51 GMT Subject: Integrated: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 In-Reply-To: References: Message-ID: <92O2-fcIQrWJVcI8wTrbrVe35_gvMchEpoU6QCSAO-A=.10ccbbf1-d387-443b-bc10-672d649473e5@github.com> On Wed, 6 Mar 2024 02:21:10 GMT, Joshua Cao wrote: > As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. > > Passing hotspot tier1 locally on my Linux machine. This pull request has now been integrated. Changeset: fbb422ec Author: Joshua Cao Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/fbb422ece7ff61bc10ebafe48ecb7f17ea315682 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 Reviewed-by: chagedorn, shade ------------- PR: https://git.openjdk.org/jdk/pull/18130 From epeter at openjdk.org Wed Mar 6 10:16:51 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Mar 2024 10:16:51 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Wed, 6 Mar 2024 08:36:28 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/compile.cpp line 2927: > >> 2925: } >> 2926: >> 2927: void Compile::gather_nodes_for_merge_stores(PhaseIterGVN &igvn) { > > This is going away, right? Good catch, it is now dead code! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514201265 From duke at openjdk.org Wed Mar 6 10:52:59 2024 From: duke at openjdk.org (Oussama Louati) Date: Wed, 6 Mar 2024 10:52:59 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v2] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with 12 additional commits since the last revision: - Refactor generateCPEntryData method signature - Delete HandleType.java as it's not used anymore change method signature - Refactor createThrowRuntimeExceptionCodeHelper method signature - Optimize imports and fix bugs - Refactor GenFullCP.java: Import cleanup and bug fixes - Refactor code to use ClassFile.of().parse() method in GenFullCP.java - Refactor generateCPEntryData method to use ClassModel and ClassFile APIs - refactor to remove unnecessary whitespaces - Refactor createThrowRuntimeExceptionCodeHelper method to use classfile API - Fix indentation in GenManyIndyCorrectBootstrap.java - ... and 2 more: https://git.openjdk.org/jdk/compare/47f24fb6...03a5e325 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/47f24fb6..03a5e325 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=00-01 Stats: 494 lines in 7 files changed: 106 ins; 104 del; 284 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From fyang at openjdk.org Wed Mar 6 11:37:45 2024 From: fyang at openjdk.org (Fei Yang) Date: Wed, 6 Mar 2024 11:37:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. ------------- PR Review: https://git.openjdk.org/jdk/pull/18114#pullrequestreview-1919503958 From rehn at openjdk.org Wed Mar 6 12:23:45 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 6 Mar 2024 12:23:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 12:52:29 GMT, Robbin Ehn wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. > So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) > > The MaxVectorSize is defined if JVMCI and/or C2 is defined: > `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. This also makes it clear that C1/interpreter don't use them, hence if someone needs a speed up there they could try to make use of them. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1980750493 From roland at openjdk.org Wed Mar 6 13:31:52 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:31:52 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 15:22:18 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/compile.cpp line 2352: > >> 2350: if (failing()) return; >> 2351: >> 2352: inline_scoped_value_get_calls(igvn); > > Suggestion: > > inline_scoped_value_get_calls(igvn); > > Indentation was wrong. Was this a result of an IDE correcting the missing braces around the if above? The indentation in that part of `Compile::Optimize()` is wrong (only a single extra character indentation after opening brace at line 2327) and the indentation for that line is correct... but not in line with the code around it. I changed it but `Compile::Optimize()` is the one that would need to be fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1514485609 From roland at openjdk.org Wed Mar 6 13:38:50 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:38:50 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 15:57:10 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/loopPredicate.cpp line 1662: > >> 1660: T_ADDRESS, MemNode::unordered); >> 1661: _igvn.register_new_node_with_optimizer(handle_load); >> 1662: set_subtree_ctrl(handle_load, true); > > How impossible is it to share code with the similar code in `GraphKit`? We would need something like what is done for `Phase::gen_subtype_check()` that is move the code out of `GraphKit` and we can't access the `GraphKit` helper methods (`basic_plus_adr()`, `make_load()` etc.) so the result would be less readable that the code in `GraphKit`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1514497015 From roland at openjdk.org Wed Mar 6 13:51:16 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:51:16 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v10] In-Reply-To: References: Message-ID: <_AgilPfFs90WJmhoV-wkNSd8Rq6ojwfDNC2SYHgpbWQ=.728d3cb4-bca5-4831-8959-44198905e34f@github.com> > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: - review - Merge branch 'master' into JDK-8320649 - review - 32 bit build fix - fix & test - Merge branch 'master' into JDK-8320649 - review - review comment - Merge branch 'master' into JDK-8320649 - Update src/hotspot/share/opto/callGenerator.cpp Co-authored-by: Emanuel Peter - ... and 6 more: https://git.openjdk.org/jdk/compare/0583f735...57592601 ------------- Changes: https://git.openjdk.org/jdk/pull/16966/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=09 Stats: 2656 lines in 39 files changed: 2587 ins; 29 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From roland at openjdk.org Wed Mar 6 13:51:16 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:51:16 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Fri, 16 Feb 2024 09:40:17 GMT, Roland Westrelin wrote: >> This change implements C2 optimizations for calls to >> ScopedValue.get(). Indeed, in: >> >> >> v1 = scopedValue.get(); >> ... >> v2 = scopedValue.get(); >> >> >> `v2` can be replaced by `v1` and the second call to `get()` can be >> optimized out. That's true whatever is between the 2 calls unless a >> new mapping for `scopedValue` is created in between (when that happens >> no optimizations is performed for the method being compiled). Hoisting >> a `get()` call out of loop for a loop invariant `scopedValue` should >> also be legal in most cases. >> >> `ScopedValue.get()` is implemented in java code as a 2 step process. A >> cache is attached to the current thread object. If the `ScopedValue` >> object is in the cache then the result from `get()` is read from >> there. Otherwise a slow call is performed that also inserts the >> mapping in the cache. The cache itself is lazily allocated. One >> `ScopedValue` can be hashed to 2 different indexes in the cache. On a >> cache probe, both indexes are checked. As a consequence, the process >> of probing the cache is a multi step process (check if the cache is >> present, check first index, check second index if first index >> failed). If the cache is populated early on, then when the method that >> calls `ScopedValue.get()` is compiled, profile reports the slow path >> as never taken and only the read from the cache is compiled. >> >> To perform the optimizations, I added 3 new node types to C2: >> >> - the pair >> ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for >> the cache probe >> >> - a cfg node ScopedValueGetResultNode to help locate the result of the >> `get()` call in the IR graph. >> >> In pseudo code, once the nodes are inserted, the code of a `get()` is: >> >> >> hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) >> if (hits_in_the_cache) { >> res = ScopedValueGetLoadFromCache(hits_in_the_cache); >> } else { >> res = ..; //slow call possibly inlined. Subgraph can be arbitray complex >> } >> res = ScopedValueGetResult(res) >> >> >> In the snippet: >> >> >> v1 = scopedValue.get(); >> ... >> v2 = scopedValue.get(); >> >> >> Replacing `v2` by `v1` is then done by starting from the >> `ScopedValueGetResult` node for the second `get()` and looking for a >> dominating `ScopedValueGetResult` for the same `ScopedValue` >> object. When one is found, it is used as a replacement. Eliminating >> the second `get()` call is achieved by making >> `ScopedValueGetHitsInCache` always successful if there's a dominating >> `Scoped... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > 32 bit build fix I pushed a new set of changes that: 1) address most of your comments 2) fix the merge conflict. I didn't make the change you suggested to the comments because, for pattern matching, I use the actual java code from `ScopedValue.get()`. I think it's easier that way to see what's being pattern matched. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1980916522 From roland at openjdk.org Wed Mar 6 13:51:16 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:51:16 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 16:09:09 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/loopTransform.cpp line 3790: > >> 3788: phase->do_peeling(this, old_new); >> 3789: return false; >> 3790: } > > Just because I'm curious: why do the other places not already peel these loops? I.e. why do we need this here? Peeling looks for a loop invariant condition with one branch that exits the loop because then peeling makes the test in the loop body redundant with the one in the peeled iteration. Here, if there's a `ScopedValue.get()` on a loop invariant `ScopedValue` object, peeling one iteration will make `ScopedValue.get()` in the loop body redundant with the one in the peeled iteration. So it's not quite the same, at least, because for `ScopedValue.get()` the optimization applies whether `ScopedValue.get()` causes an exit of the loop or not. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1514512542 From roland at openjdk.org Wed Mar 6 14:00:45 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 14:00:45 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Wed, 28 Feb 2024 08:58:23 GMT, Christian Hagedorn wrote: >> Long counted loop are transformed into a loop nest of 2 "regular" >> loops and in a subsequent loop opts round, the inner loop is >> transformed into a counted loop. The limit for the inner loop is set, >> when the loop nest is created, so it's expected there's no need for a >> loop limit check when the counted loop is created. The assert fires >> because, when the counted loop is created, it is found that it needs a >> loop limit check. The reason for that is that the limit is >> transformed, between nest creation and counted loop creation, in a way >> that the range of values of the inner loop's limit becomes >> unknown. The limit when the nest is created is: >> >> >> 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 >> 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) >> 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] >> 122 ConvL2I === _ 112 [[ ]] #int >> >> >> The type of 122 is `2..6` but it is then transformed to: >> >> >> 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) >> 191 ConvL2I === _ 106 [[ 196 ]] #int >> 195 ConI === 0 [[ 196 ]] #int:max-1 >> 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] >> >> >> That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI >> (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of >> values returns TypeInt::INT and the bounds of the limit are lost. I >> propose adding a `CastII` after the `ConvL2I` so the range of values >> of the limit doesn't get lost. > > src/hotspot/share/opto/loopnode.cpp line 955: > >> 953: // opts pass, an accurate range of values for the limits is found. >> 954: const TypeInt* inner_iters_actual_int_range = TypeInt::make(0, iters_limit, Type::WidenMin); >> 955: inner_iters_actual_int = new CastIINode(outer_head, inner_iters_actual_int, inner_iters_actual_int_range, ConstraintCastNode::UnconditionalDependency); > > The fix idea looks reasonable to me. I have two questions: > - Do we really need to pin the `CastII` here? We have not pinned the `ConvL2I` before. And here I think we just want to ensure that the type is not lost. > - Related to the first question, could we just use a normal dependency instead? > > I was also wondering if we should try to improve the type of `ConvL2I` and of `Add/Sub` (and possibly also `Mul`) nodes in general? For `ConvL2I`, we could set a better type if we know that `(int)lo <= (int)hi` and `abs(hi - lo) <= 2^32`. We still have a problem to set a better type if we have a narrow range of inputs that includes `min` and `max` (e.g. `min+1, min, max, max-1`). In this case, `ConvL2I` just uses `int` as type. Then we could go a step further and do the same type optimization for `Add/Sub` nodes by directly looking through a convert/cast node at the input type. The resulting `Add/Sub` range could maybe be represented by something better than `int`: > > Example: > input type to `ConvL2I`: `[2147483647L, 2147483648L]` -> type of `ConvL2I` is `int` since we cannot represent "`[max_int, min_int]`" with two intervals otherwise. > `AddI` = `ConvL2I` + 2 -> type could be improved to `[min_int+1,min_int+2]`. > > > But that might succeed the scope of this fix. Going with `CastII` for now seems to be the least risk. Thanks for reviewing this. > The fix idea looks reasonable to me. I have two questions: > > * Do we really need to pin the `CastII` here? We have not pinned the `ConvL2I` before. And here I think we just want to ensure that the type is not lost. I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. > * Related to the first question, could we just use a normal dependency instead? The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. > I was also wondering if we should try to improve the type of `ConvL2I` and of `Add/Sub` (and possibly also `Mul`) nodes in general? For `ConvL2I`, we could set a better type if we know that `(int)lo <= (int)hi` and `abs(hi - lo) <= 2^32`. We still have a problem to set a better type if we have a narrow range of inputs that includes `min` and `max` (e.g. `min+1, min, max, max-1`). In this case, `ConvL2I` just uses `int` as type. Then we could go a step further and do the same type optimization for `Add/Sub` nodes by directly looking through a convert/cast node at the input type. The resulting `Add/Sub` range could maybe be represented by something better than `int`: > > Example: input type to `ConvL2I`: `[2147483647L, 2147483648L]` -> type of `ConvL2I` is `int` since we cannot represent "`[max_int, min_int]`" with two intervals otherwise. `AddI` = `ConvL2I` + 2 -> type could be improved to `[min_int+1,min_int+2]`. > > But that might succeed the scope of this fix. Going with `CastII` for now seems to be the least risk. I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1514532046 From chagedorn at openjdk.org Wed Mar 6 14:22:11 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 14:22:11 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v5] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: format ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/14b46ba6..9a3d97e3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From gcao at openjdk.org Wed Mar 6 14:29:59 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 14:29:59 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v2] In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Fix for robehn comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18114/files - new: https://git.openjdk.org/jdk/pull/18114/files/893da741..e4bc8405 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=00-01 Stats: 18 lines in 1 file changed: 17 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From gcao at openjdk.org Wed Mar 6 14:37:01 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 14:37:01 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v3] In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build Gui Cao has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Fix for robehn comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18114/files - new: https://git.openjdk.org/jdk/pull/18114/files/e4bc8405..0c7a6780 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From gcao at openjdk.org Wed Mar 6 14:50:47 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 14:50:47 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Wed, 6 Mar 2024 12:20:43 GMT, Robbin Ehn wrote: >> The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. >> So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) >> >> The MaxVectorSize is defined if JVMCI and/or C2 is defined: >> `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > >> I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. > > This also makes it clear that C1/interpreter don't use them, hence if someone needs a speed up there they could try to make use of them. @robehn @RealFYang Thanks for your review, I've put those functions definitions into a #if COMPILER2_OR_JVMCI block to avoid such a problem. The reason why #ifdef COMPILER2_OR_JVMCI is not used is because it might be defined as `#define COMPILER2_OR_JVMCI 0` Could you please look at it again? // COMPILER2 or JVMCI #if defined(COMPILER2) || INCLUDE_JVMCI #define COMPILER2_OR_JVMCI 1 #define COMPILER2_OR_JVMCI_PRESENT(code) code #define NOT_COMPILER2_OR_JVMCI(code) #define NOT_COMPILER2_OR_JVMCI_RETURN /* next token must be ; */ #define NOT_COMPILER2_OR_JVMCI_RETURN_(code) /* next token must be ; */ #else #define COMPILER2_OR_JVMCI 0 #define COMPILER2_OR_JVMCI_PRESENT(code) #define NOT_COMPILER2_OR_JVMCI(code) code #define NOT_COMPILER2_OR_JVMCI_RETURN {} #define NOT_COMPILER2_OR_JVMCI_RETURN_(code) { return code; } #endif ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1981034579 From chagedorn at openjdk.org Wed Mar 6 14:52:46 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 14:52:46 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Thu, 22 Feb 2024 14:36:52 GMT, Roland Westrelin wrote: > Long counted loop are transformed into a loop nest of 2 "regular" > loops and in a subsequent loop opts round, the inner loop is > transformed into a counted loop. The limit for the inner loop is set, > when the loop nest is created, so it's expected there's no need for a > loop limit check when the counted loop is created. The assert fires > because, when the counted loop is created, it is found that it needs a > loop limit check. The reason for that is that the limit is > transformed, between nest creation and counted loop creation, in a way > that the range of values of the inner loop's limit becomes > unknown. The limit when the nest is created is: > > > 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 > 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] > 122 ConvL2I === _ 112 [[ ]] #int > > > The type of 122 is `2..6` but it is then transformed to: > > > 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 191 ConvL2I === _ 106 [[ 196 ]] #int > 195 ConI === 0 [[ 196 ]] #int:max-1 > 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] > > > That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI > (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of > values returns TypeInt::INT and the bounds of the limit are lost. I > propose adding a `CastII` after the `ConvL2I` so the range of values > of the limit doesn't get lost. Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17965#pullrequestreview-1919967526 From chagedorn at openjdk.org Wed Mar 6 14:52:47 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 14:52:47 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 13:57:53 GMT, Roland Westrelin wrote: > I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. That is indeed a general problem. The situation certainly got better by removing the code that optimized cast nodes that were pinned at If Projections (https://github.com/openjdk/jdk/commit/7766785098816cfcdae3479540cdc866c1ed18ad). By pinning the casts now, you probably want to prevent the cast nodes to be pushed through nodes such that it floats "too high" and causing unforeseenable data graph folding while control is not? > The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. I see, thanks for the explanation. Then it makes sense to keep the cast node not matter what. > I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. I agree, this fix should use casts. Would be interesting to follow this idea in a separate RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1514615913 From kvn at openjdk.org Wed Mar 6 17:04:51 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Mar 2024 17:04:51 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> References: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> Message-ID: On Tue, 5 Mar 2024 06:40:00 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad Latest version v03 testing passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1981360465 From duke at openjdk.org Wed Mar 6 17:04:52 2024 From: duke at openjdk.org (kuaiwei) Date: Wed, 6 Mar 2024 17:04:52 GMT Subject: Integrated: 8326983: Unused operands reported after JDK-8326135 In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 05:46:57 GMT, kuaiwei wrote: > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. > I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. This pull request has now been integrated. Changeset: e92ecd97 Author: Kuai Wei Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/e92ecd9703e0a4f71d52a159516785a3eab5195a Stats: 993 lines in 10 files changed: 17 ins; 966 del; 10 mod 8326983: Unused operands reported after JDK-8326135 Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/18075 From sviswanathan at openjdk.org Wed Mar 6 22:04:56 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 6 Mar 2024 22:04:56 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 22:37:49 GMT, Dean Long wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> unify the implementation > > So if we can still generate the non-AVX encoding of > > `roundsd dst, src, mode` > > isn't there still a false dependency problem with `dst`? @dean-long You bring up a very good point. The SSE instruction (roundsd dst, src, mode) also has a false dependency problem. This can be demonstrated by adding the following benchmark to MathBench.java: diff --git a/test/micro/org/openjdk/bench/java/lang/MathBench.java b/test/micro/org/openjdk/bench/java/lang/MathBench.java index c7dde019154..feb472bba3d 100644 --- a/test/micro/org/openjdk/bench/java/lang/MathBench.java +++ b/test/micro/org/openjdk/bench/java/lang/MathBench.java @@ -141,6 +141,11 @@ public double ceilDouble() { return Math.ceil(double4Dot1); } + @Benchmark + public double useAfterCeilDouble() { + return Math.ceil(double4Dot1) + Math.floor(double4Dot1); + } + @Benchmark public double copySignDouble() { return Math.copySign(double81, doubleNegative12); The fix would be to do a pxor on dst before the SSE roundsd instruction, something like below: diff --git a/src/hotspot/cpu/x86/x86.ad b/src/hotspot/cpu/x86/x86.ad index cf4aef83df2..eb6701f82a7 100644 --- a/src/hotspot/cpu/x86/x86.ad +++ b/src/hotspot/cpu/x86/x86.ad @@ -3874,6 +3874,9 @@ instruct roundD_reg(legRegD dst, legRegD src, immU8 rmode) %{ ins_cost(150); ins_encode %{ assert(UseSSE >= 4, "required"); + if ((UseAVX == 0) && ($dst$$XMMRegister != $src$$XMMRegister)) { + __ pxor($dst$$XMMRegister, $dst$$XMMRegister); + } __ roundsd($dst$$XMMRegister, $src$$XMMRegister, $rmode$$constant); %} ins_pipe(pipe_slow); ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1981879809 From fyang at openjdk.org Thu Mar 7 02:32:53 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 7 Mar 2024 02:32:53 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 04:06:47 GMT, Gui Cao wrote: > Hi, I noticed that RISC-V missed this change from #11044 [1]: > > `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` > > `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` > > > After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. > > [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 > > ### Tests > > - [x] Run tier1-3 tests on SiFive unmatched (release) Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18131#pullrequestreview-1921298462 From ddong at openjdk.org Thu Mar 7 03:00:00 2024 From: ddong at openjdk.org (Denghui Dong) Date: Thu, 7 Mar 2024 03:00:00 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: <3t8sJ2UoFZRS_8XbDrlNRawPW7ZgpPNwqgbyVUcJNiI=.d1f693bf-e8d6-4ce5-8596-019cae1918a1@github.com> On Wed, 6 Mar 2024 06:16:54 GMT, Guoxiong Li wrote: >> Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: >> >> update header > > The patch looks good. But I don't really know whether it deserves to do that. @lgxbslgx @vnkozlov Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18125#issuecomment-1982245847 From ddong at openjdk.org Thu Mar 7 03:00:00 2024 From: ddong at openjdk.org (Denghui Dong) Date: Thu, 7 Mar 2024 03:00:00 GMT Subject: Integrated: 8327379: Make TimeLinearScan a develop flag In-Reply-To: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: <9qY_rjrToWVdaI3scSYxILx8nCnPsxvPTPWQFgjGFTI=.6d23a92d-a168-4267-9406-2612907f54f1@github.com> On Tue, 5 Mar 2024 15:54:34 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that makes TimeLinearScan a develop flag. > > Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. This pull request has now been integrated. Changeset: 40183412 Author: Denghui Dong URL: https://git.openjdk.org/jdk/commit/401834122dc3afb3feb9f7b31fc785de82ba2e58 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8327379: Make TimeLinearScan a develop flag Reviewed-by: gli, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18125 From dlong at openjdk.org Thu Mar 7 03:15:55 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 7 Mar 2024 03:15:55 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:12:12 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. I'm looking at it again, and I'm trying to figure how we can minimize platform-specific changes. I'm hoping we can move some of the set_force_reexecute boiler-plate code into shared code. We probably don't need _force_reexecute in CodeEmitInfo anymore. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1982266410 From fyang at openjdk.org Thu Mar 7 04:24:56 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 7 Mar 2024 04:24:56 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v3] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: <-t4-oX3IpX98aNrYJrhKGFypWKcRZ6hLFmacEircyM4=.41bb78d2-80e2-40d4-b489-3ca99f3b297e@github.com> On Wed, 6 Mar 2024 14:37:01 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Fix for robehn comment Hi, I think we can simply move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1982321606 From gcao at openjdk.org Thu Mar 7 06:19:03 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 06:19:03 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build Gui Cao has updated the pull request incrementally with two additional commits since the last revision: - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18114/files - new: https://git.openjdk.org/jdk/pull/18114/files/0c7a6780..a2199a46 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=02-03 Stats: 685 lines in 1 file changed: 317 ins; 334 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From epeter at openjdk.org Thu Mar 7 06:55:59 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 06:55:59 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> On Wed, 6 Mar 2024 08:36:57 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/memnode.cpp line 2802: > >> 2800: StoreNode* use = can_merge_primitive_array_store_with_use(phase, true); >> 2801: if (use != nullptr) { >> 2802: return nullptr; > > Do you want to assert that the use is in the igvn worklist? Hmm. I think that would not be a good assert. Let's assume we have 4 stores that would merge. Then the last one of them does the merging, and replaces itself with the merged store. If `this` is the second store, then it could merge with its def (the first store). But since it has a use that could also be merged with, we delegate the merging down. But it is not done by the 3rd store, rather the 4th. So we cannot assert that the 3rd would be in the worklist. The 3rd may have been processed before, and determined that it does not want to idealize itself, and be removed from the worklist. Maybe I can improve the comment: // Merging is done by the last store in a chain. We have a use that could be merged with, so we // are not the last store, and hence must wait for some (recursive) use to do the merge. > src/hotspot/share/opto/memnode.cpp line 3146: > >> 3144: Node* ctrl_s1 = s1->in(MemNode::Control); >> 3145: Node* ctrl_s2 = s2->in(MemNode::Control); >> 3146: if (ctrl_s1 != ctrl_s2) { > > Do you need to check that `ctrl_s1` and `ctrl_s2` are not null? I suppose this could be called on a dying part of the graph during igvn. @rwestrel but then would they not be `TOP` rather than `nullptr`? > src/hotspot/share/opto/memnode.hpp line 578: > >> 576: >> 577: Node* Ideal_merge_primitive_array_stores(PhaseGVN* phase); >> 578: StoreNode* can_merge_primitive_array_store_with_use(PhaseGVN* phase, bool check_def); > > If I understand correctly you need the `check_def ` parameter to avoid having `can_merge_primitive_array_store_with_use` and `can_merge_primitive_array_store_with_def` call each other indefinitely. But if I was to write new code that takes advantage of one of the two methods, I think I would be puzzled that there's a `check_def` parameter. Passing `false` would be wrong then but maybe not immediately obvious. Maybe it would be better to have `can_merge_primitive_array_store_with_def` with no `check_def` parameter and have all the work done in a utility method that takes a `check_def` parameter (always `true` when called from `can_merge_primitive_array_store_with_def`) You are right, this is not the best code pattern. I'll refactor it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515627795 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515629804 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515628308 From epeter at openjdk.org Thu Mar 7 06:58:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 06:58:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Wed, 6 Mar 2024 08:52:16 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/memnode.cpp line 3154: > >> 3152: } >> 3153: ProjNode* other_proj = ctrl_s1->as_IfProj()->other_if_proj(); >> 3154: if (other_proj->is_uncommon_trap_proj(Deoptimization::Reason_range_check) == nullptr || > > This could be a range check for an unrelated array I suppose. Does it matter? I don't think it matters, no. Do you see a scenario where it would matter? My argument: It is safe to do the stores after the RC rather than before it. And if the RC trap relies on the memory state of the stores that were before the RC, then those stores simply don't lose all their uses, and stay in the graph. After all, we only remove the "last" store by replacing it with the merged store, so the other stores only disappear if they have no other use. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515632976 From gcao at openjdk.org Thu Mar 7 07:11:55 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 07:11:55 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Thu, 7 Mar 2024 06:19:03 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has updated the pull request incrementally with two additional commits since the last revision: > > - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block > - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize Hi, I've made the changes for the review, and the minimal/server build was successful. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1982666432 From fyang at openjdk.org Thu Mar 7 07:14:54 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 7 Mar 2024 07:14:54 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Thu, 7 Mar 2024 06:19:03 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has updated the pull request incrementally with two additional commits since the last revision: > > - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block > - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18114#pullrequestreview-1921613004 From epeter at openjdk.org Thu Mar 7 07:47:57 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:47:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v6] In-Reply-To: References: Message-ID: On Tue, 30 Jan 2024 14:43:41 GMT, Roland Westrelin wrote: >> @shipilev >> You are right, I need to guard the optimization with `UseUnalignedAccesses`. Just added it. Thanks you ? >> Probably my tests would have run into the `SIGBUS` you mentioned. >> >> About `InitializeNode::coalesce_subword_stores`: >> It only works on raw-stores, which write fields before the initialization of an object. It only works with constants. >> Hence, the pattern is quite different. >> Merging the two would be a lot of work. Too much for me for now. >> But maybe one day we can cover all these cases in a single optimization, that merges/coalesces all sorts of loads and stores, and essencially vectorizes any straingt-line code, at least for loads and stores. >> For now, I just wanted to add the feature that @cl4es and @RogerRiggs were specifically asking for, which is merging array stores for constants and variables (using shift to split). >> >> @rwestrel >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past `ConvI2L` and `CastII` nodes. > >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past ConvI2L and CastII nodes. > > Do you still need a traversal of the graph to find the Stores or can you enqueue them for post loop opts then? @rwestrel > Do you intend to add an IR test case? I already have IR tests that also do result verification: `test/hotspot/jtreg/compiler/c2/TestMergeStores.java` ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1982785244 From epeter at openjdk.org Thu Mar 7 07:47:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:47:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Wed, 6 Mar 2024 08:58:01 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/memnode.cpp line 2971: > >> 2969: // The goal is to check if two such ArrayPointers are adjacent for a load or store. >> 2970: // >> 2971: // Note: we accumulate all constant offsets into constant_offset, even the int constant behind > > Is this really needed? For the patterns of interest, aren't the constant pushed down the chain of `AddP` nodes so the address is `(AddP base (AddP ...) constant)`? No, they are not pushed down. Consider the access on an int array: `a[invar + 1]` -> `adr = base + ARRAY_INT_BASE_OFFSET + 4 * ConvI2L(invar + 1)` We cannot just push the constant `1` out of the `ConvI2L`, after all `invar + 1` could overflow in the int domain ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515686654 From epeter at openjdk.org Thu Mar 7 07:51:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:51:58 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 16:59:03 GMT, Christian Hagedorn wrote: >> This is a regression fix from https://github.com/openjdk/jdk/pull/17657. >> >> I had never encountered an example where a data node in the loop body did not have any input node in the loop. >> My assumption was that this should never happen, such a node should move out of the loop itself. >> >> I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. >> >> https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 >> >> I now had a few options: >> 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. >> 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. >> 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. >> >> I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. > > That looks reasonable. I agree to fix the ctrl issues separately and go with a bailout solution for now. Maybe you want to add a note at [JDK-8307982](https://bugs.openjdk.org/browse/JDK-8307982) to not forget about this case here. Thanks @chhagedorn @vnkozlov for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18123#issuecomment-1982793118 From epeter at openjdk.org Thu Mar 7 07:51:59 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:51:59 GMT Subject: Integrated: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 14:53:33 GMT, Emanuel Peter wrote: > This is a regression fix from https://github.com/openjdk/jdk/pull/17657. > > I had never encountered an example where a data node in the loop body did not have any input node in the loop. > My assumption was that this should never happen, such a node should move out of the loop itself. > > I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. > > https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 > > I now had a few options: > 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. > 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. > 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. > > I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. This pull request has now been integrated. Changeset: f54e5983 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/f54e59835492e86b9178b2050901579707f41100 Stats: 103 lines in 3 files changed: 100 ins; 1 del; 2 mod 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18123 From epeter at openjdk.org Thu Mar 7 07:56:03 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:56:03 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> On Wed, 6 Mar 2024 13:48:36 GMT, Roland Westrelin wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > I pushed a new set of changes that: > 1) address most of your comments > 2) fix the merge conflict. > I didn't make the change you suggested to the comments because, for pattern matching, I use the actual java code from `ScopedValue.get()`. I think it's easier that way to see what's being pattern matched. @rwestrel nice! I'll run our testing again, now that it is merged. FYI: you have some whitespace issues in: `src/hotspot/share/opto/callGenerator.cpp` ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1982809402 From rehn at openjdk.org Thu Mar 7 07:56:54 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 7 Mar 2024 07:56:54 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Thu, 7 Mar 2024 06:19:03 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has updated the pull request incrementally with two additional commits since the last revision: > > - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block > - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize Thanks! ------------- Marked as reviewed by rehn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18114#pullrequestreview-1921686927 From roland at openjdk.org Thu Mar 7 08:15:17 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 08:15:17 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Thu, 7 Mar 2024 07:53:02 GMT, Emanuel Peter wrote: > FYI: you have some whitespace issues in: `src/hotspot/share/opto/callGenerator.cpp` Thanks. I missed it. Fixed now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1982870654 From roland at openjdk.org Thu Mar 7 08:15:17 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 08:15:17 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v11] In-Reply-To: References: Message-ID: > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: whitespaces ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16966/files - new: https://git.openjdk.org/jdk/pull/16966/files/57592601..361a6ab7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=09-10 Stats: 24 lines in 1 file changed: 0 ins; 0 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From epeter at openjdk.org Thu Mar 7 08:26:53 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:26:53 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: <-CsTGyIK4TUjYf3tHEdBQqivYrm3oju-J7rFgb9IvEw=.968e0f43-02eb-4491-9a37-31fc72be2445@github.com> On Mon, 26 Feb 2024 23:23:57 GMT, Joshua Cao wrote: >> For example, `x + 1 < 2` -> `x < 2 - 1` iff we can prove that `x + 1` does not overflow and `2 - 1` does not overflow. We can always fold if it is an `==` or `!=` since overflow will not affect the result of the comparison. >> >> Consider this more practical example: >> >> >> public void foo(int[] arr) { >> for (i = arr.length - 1; i >= 0; --i) { >> blackhole(arr[i]); >> } >> } >> >> >> C2 emits a loop guard that looks `arr.length - 1 < 0`. We know `arr.length - 1` does not overflow because `arr.length` is positive. We can fold the comparison into `arr.length < 1`. We have to compute `arr.length - 1` computation if we enter the loop anyway, but we can avoid the subtraction computation if we never enter the loop. I believe the simplification can also help with stronger integer range analysis in https://bugs.openjdk.org/browse/JDK-8275202. >> >> Some additional notes: >> * there is various overflow checking code across `src/hotspot/share/opto`. I separated about the functions from convertnode.cpp into `type.hpp`. Maybe the functions belong somewhere else? >> * there is a change in Parse::do_if() to repeatedly apply GVN until the test is canonical. We need multiple iterations in the case of `C1 > C2 - X` -> `C2 - X < C1` -> `C2 < X` -> `X > C2`. This fails the assertion if `BoolTest(btest).is_canonical()`. We can avoid this by applying GVN one more time to get `C2 < X`. >> * we should not transform loop backedge conditions. For example, if we have `for (i = 0; i < 10; ++i) {}`, the backedge condition is `i + 1 < 10`. If we transform it into `i < 9`, it messes with CountedLoop's recognition of induction variables and strides.r >> * this change optimizes some of the equality checks in `TestUnsignedComparison.java` and breaks the IR checks. I removed those tests. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > comments with explanations and style changes Ok. I discussed it quickly with @vnkozlov. He said we should be careful, and I'll have to run high tier testing on our side and some performance testing as well. But we can go ahead. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17853#issuecomment-1982905864 From epeter at openjdk.org Thu Mar 7 08:31:54 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:31:54 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: On Mon, 26 Feb 2024 23:23:57 GMT, Joshua Cao wrote: >> For example, `x + 1 < 2` -> `x < 2 - 1` iff we can prove that `x + 1` does not overflow and `2 - 1` does not overflow. We can always fold if it is an `==` or `!=` since overflow will not affect the result of the comparison. >> >> Consider this more practical example: >> >> >> public void foo(int[] arr) { >> for (i = arr.length - 1; i >= 0; --i) { >> blackhole(arr[i]); >> } >> } >> >> >> C2 emits a loop guard that looks `arr.length - 1 < 0`. We know `arr.length - 1` does not overflow because `arr.length` is positive. We can fold the comparison into `arr.length < 1`. We have to compute `arr.length - 1` computation if we enter the loop anyway, but we can avoid the subtraction computation if we never enter the loop. I believe the simplification can also help with stronger integer range analysis in https://bugs.openjdk.org/browse/JDK-8275202. >> >> Some additional notes: >> * there is various overflow checking code across `src/hotspot/share/opto`. I separated about the functions from convertnode.cpp into `type.hpp`. Maybe the functions belong somewhere else? >> * there is a change in Parse::do_if() to repeatedly apply GVN until the test is canonical. We need multiple iterations in the case of `C1 > C2 - X` -> `C2 - X < C1` -> `C2 < X` -> `X > C2`. This fails the assertion if `BoolTest(btest).is_canonical()`. We can avoid this by applying GVN one more time to get `C2 < X`. >> * we should not transform loop backedge conditions. For example, if we have `for (i = 0; i < 10; ++i) {}`, the backedge condition is `i + 1 < 10`. If we transform it into `i < 9`, it messes with CountedLoop's recognition of induction variables and strides.r >> * this change optimizes some of the equality checks in `TestUnsignedComparison.java` and breaks the IR checks. I removed those tests. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > comments with explanations and style changes src/hotspot/share/opto/subnode.cpp line 1586: > 1584: } > 1585: } > 1586: } This looks like heavy code duplication. Can you refactor this? Maybe a helper method? src/hotspot/share/opto/type.cpp line 1761: > 1759: } > 1760: return true; > 1761: } Do you maybe want to assert that no other opcode comes in? Or is there a need for non add/sub opcodes to be passed in? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1515739956 PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1515741084 From epeter at openjdk.org Thu Mar 7 08:51:57 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:51:57 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: References: Message-ID: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> On Wed, 6 Mar 2024 06:13:02 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Change transform to work on CMoves Nice work, I think this looks much better now! I'm currently a bit tight on time, I'll run the benchmark on my next pass ;) src/hotspot/share/opto/movenode.cpp line 189: > 187: > 188: // Try to identify min/max patterns in CMoves > 189: static Node* is_minmax(PhaseGVN* phase, Node* cmov) { I'm not a fan of `is_...` methods that do more than a check, but actually have a side-effect. I also suggest that `cmov` should already have a `CMovNode` type, and there should be an assert here. I would probably do it similar to `AddNode::IdealIL` and `AddPNode::Ideal_base_and_offset`: call it `CMoveNode::IdealIL_minmax`. But add an assert to check for int or long. src/hotspot/share/opto/movenode.cpp line 322: > 320: if (phase->C->post_loop_opts_phase()) { > 321: return nullptr; > 322: } Putting the condition here would prevent any future optimization further down not to be executed. I think you should rather put this into the `is_minmax` method. Maybe this condition is now only relevant for `long`, but I think it would not hurt to also have it also for `int`, right? test/hotspot/jtreg/compiler/c2/irTests/TestIfMinMax.java line 139: > 137: public long testMaxL2E(long a, long b) { > 138: return a <= b ? b : a; > 139: } I assume some of the `long` patterns should also have become MaxL/MinL in some phase, right? Is there maybe some phase where the IR would actually show that? You can target the IR rule to a phase, I think. Would be worth a try. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17574#pullrequestreview-1921796112 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515754879 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515765294 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515768176 From epeter at openjdk.org Thu Mar 7 08:51:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:51:58 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> Message-ID: <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> On Thu, 7 Mar 2024 08:38:42 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Change transform to work on CMoves > > src/hotspot/share/opto/movenode.cpp line 189: > >> 187: >> 188: // Try to identify min/max patterns in CMoves >> 189: static Node* is_minmax(PhaseGVN* phase, Node* cmov) { > > I'm not a fan of `is_...` methods that do more than a check, but actually have a side-effect. > I also suggest that `cmov` should already have a `CMovNode` type, and there should be an assert here. > > I would probably do it similar to `AddNode::IdealIL` and `AddPNode::Ideal_base_and_offset`: call it `CMoveNode::IdealIL_minmax`. But add an assert to check for int or long. And then you could actualy move the call to `CMoveNode::Ideal`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515761304 From epeter at openjdk.org Thu Mar 7 08:51:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:51:58 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> Message-ID: <7vRLiWJ_2IIkKnFbdwNqg_fKT3WYuvj7YZCXcKx1cFE=.d4c1fd64-24cc-40e0-8707-4eed20a26135@github.com> On Thu, 7 Mar 2024 08:42:42 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/movenode.cpp line 189: >> >>> 187: >>> 188: // Try to identify min/max patterns in CMoves >>> 189: static Node* is_minmax(PhaseGVN* phase, Node* cmov) { >> >> I'm not a fan of `is_...` methods that do more than a check, but actually have a side-effect. >> I also suggest that `cmov` should already have a `CMovNode` type, and there should be an assert here. >> >> I would probably do it similar to `AddNode::IdealIL` and `AddPNode::Ideal_base_and_offset`: call it `CMoveNode::IdealIL_minmax`. But add an assert to check for int or long. > > And then you could actualy move the call to `CMoveNode::Ideal`. Who knows, maybe we one day extend this to other types ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515761693 From gcao at openjdk.org Thu Mar 7 09:06:53 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 09:06:53 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Wed, 6 Mar 2024 12:20:43 GMT, Robbin Ehn wrote: >> The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. >> So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) >> >> The MaxVectorSize is defined if JVMCI and/or C2 is defined: >> `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > >> I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. > > This also makes it clear that C1/interpreter don't use them, hence if someone needs a speed up there they could try to make use of them. @robehn @RealFYang : Thanks all for the review ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1983032275 From gcao at openjdk.org Thu Mar 7 09:16:57 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 09:16:57 GMT Subject: Integrated: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build This pull request has now been integrated. Changeset: 12617405 Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/1261740521e364cf40ca7ee160fc10c608d9ab71 Stats: 506 lines in 1 file changed: 253 ins; 252 del; 1 mod 8327283: RISC-V: Minimal build failed after JDK-8319716 Reviewed-by: fyang, rehn ------------- PR: https://git.openjdk.org/jdk/pull/18114 From duke at openjdk.org Thu Mar 7 09:42:07 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 09:42:07 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v3] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Fix bytecode length calculation in GenFullCP.java and add new imports in ClassWriterExt.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/03a5e325..c8315dea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=01-02 Stats: 6 lines in 2 files changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From enikitin at openjdk.org Thu Mar 7 12:36:11 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Thu, 7 Mar 2024 12:36:11 GMT Subject: RFR: 8327390: JitTester: Implement temporary folder functionality Message-ID: The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. ------------- Commit messages: - 8327390: JitTester: Implement temporary folder functionality Changes: https://git.openjdk.org/jdk/pull/18128/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18128&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327390 Stats: 76 lines in 4 files changed: 63 ins; 4 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/18128.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18128/head:pull/18128 PR: https://git.openjdk.org/jdk/pull/18128 From gli at openjdk.org Thu Mar 7 12:36:11 2024 From: gli at openjdk.org (Guoxiong Li) Date: Thu, 7 Mar 2024 12:36:11 GMT Subject: RFR: 8327390: JitTester: Implement temporary folder functionality In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 19:58:04 GMT, Evgeny Nikitin wrote: > The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. > > This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. Looks good. And the issue is not a public visible issue, . Please mark the issue as non-secret. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18128#pullrequestreview-1919323835 From lmesnik at openjdk.org Thu Mar 7 12:36:11 2024 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Thu, 7 Mar 2024 12:36:11 GMT Subject: RFR: 8327390: JitTester: Implement temporary folder functionality In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 19:58:04 GMT, Evgeny Nikitin wrote: > The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. > > This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. Marked as reviewed by lmesnik (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18128#pullrequestreview-1920425205 From duke at openjdk.org Thu Mar 7 13:56:07 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 13:56:07 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v4] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Update imports in GenManyIndyCorrectBootstrap.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/c8315dea..0ef0b28f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=02-03 Stats: 6 lines in 1 file changed: 0 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From duke at openjdk.org Thu Mar 7 14:04:07 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 14:04:07 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Fix typo in error message in GenManyIndyIncorrectBootstrap.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/0ef0b28f..89292423 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=03-04 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From duke at openjdk.org Thu Mar 7 14:07:56 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 14:07:56 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 14:04:07 GMT, Oussama Louati wrote: >> Completion of the first version of the migration for several tests. >> >> These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: >> >> - Generate constant pool entries filled with method handles and method types. >> - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. >> - Produce many invokedynamic instructions with a specific constant pool entry. > > Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in error message in GenManyIndyIncorrectBootstrap.java I ran the JTreg test on this PR Head after full conversion of these tests, and nothing unusual happened, those aren't explicitly related to something else. ------------- PR Review: https://git.openjdk.org/jdk/pull/17834#pullrequestreview-1922528839 From rkennke at openjdk.org Thu Mar 7 14:39:53 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 7 Mar 2024 14:39:53 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:33:56 GMT, Galder Zamarre?o wrote: >> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: >> >> RISCV changes > > I think the changes look fine, but looking closer to the original PR, src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.hpp might also need adjusting. s390 and ppc are probably just fine. @galderz is it ok now? I assume it counts as trivial, too? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18120#issuecomment-1983641324 From roland at openjdk.org Thu Mar 7 14:51:58 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 14:51:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v6] In-Reply-To: References: Message-ID: <8W7bn8q19_y3Jan9YSHlX_pvi6q_jllpLpTuHXhSjFw=.0b9fceb7-0a3b-4587-8969-23a997a2dd74@github.com> On Tue, 30 Jan 2024 14:43:41 GMT, Roland Westrelin wrote: >> @shipilev >> You are right, I need to guard the optimization with `UseUnalignedAccesses`. Just added it. Thanks you ? >> Probably my tests would have run into the `SIGBUS` you mentioned. >> >> About `InitializeNode::coalesce_subword_stores`: >> It only works on raw-stores, which write fields before the initialization of an object. It only works with constants. >> Hence, the pattern is quite different. >> Merging the two would be a lot of work. Too much for me for now. >> But maybe one day we can cover all these cases in a single optimization, that merges/coalesces all sorts of loads and stores, and essencially vectorizes any straingt-line code, at least for loads and stores. >> For now, I just wanted to add the feature that @cl4es and @RogerRiggs were specifically asking for, which is merging array stores for constants and variables (using shift to split). >> >> @rwestrel >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past `ConvI2L` and `CastII` nodes. > >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past ConvI2L and CastII nodes. > > Do you still need a traversal of the graph to find the Stores or can you enqueue them for post loop opts then? > @rwestrel > > > Do you intend to add an IR test case? > > I already have IR tests that also do result verification: `test/hotspot/jtreg/compiler/c2/TestMergeStores.java` I missed it. I expected it in `irTests` subdirectory. Why isn't that the case BTW? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1983668344 From roland at openjdk.org Thu Mar 7 14:58:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 14:58:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Thu, 7 Mar 2024 07:45:13 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/memnode.cpp line 2971: >> >>> 2969: // The goal is to check if two such ArrayPointers are adjacent for a load or store. >>> 2970: // >>> 2971: // Note: we accumulate all constant offsets into constant_offset, even the int constant behind >> >> Is this really needed? For the patterns of interest, aren't the constant pushed down the chain of `AddP` nodes so the address is `(AddP base (AddP ...) constant)`? > > No, they are not pushed down. > Consider the access on an int array: > `a[invar + 1]` -> `adr = base + ARRAY_INT_BASE_OFFSET + 4 * ConvI2L(invar + 1)` > We cannot just push the constant `1` out of the `ConvI2L`, after all `invar + 1` could overflow in the int domain ;) That's not quite right, I think. For instance, in this method: private static int test(int[] array, int i) { return array[i + 1]; } the final IR will have the `(AddP base (AddP ...) constant)` because `ConvI2LNode::Ideal` does more than checking for overflow. The actual transformation to that final shape must be delayed until after the CastII nodes are removed though. Why that's the case is puzzling actually because `CastIINode::Ideal()` has logic to push the AddI thru the `CastII` but it's disabled for range check `CastII` nodes. I noticed this while working on 8324517. My recollection was that `ConvI2LNode::Ideal` would push thru both the `CastII` and `ConvI2L` in one go so I wonder if it got broken at some point. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516290785 From epeter at openjdk.org Thu Mar 7 15:41:20 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 15:41:20 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v3] In-Reply-To: References: Message-ID: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: missing string Extra -> Memory change ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/67def8d2..31b65c6c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From roland at openjdk.org Thu Mar 7 15:29:00 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 15:29:00 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> Message-ID: On Thu, 7 Mar 2024 06:53:21 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/memnode.cpp line 3146: >> >>> 3144: Node* ctrl_s1 = s1->in(MemNode::Control); >>> 3145: Node* ctrl_s2 = s2->in(MemNode::Control); >>> 3146: if (ctrl_s1 != ctrl_s2) { >> >> Do you need to check that `ctrl_s1` and `ctrl_s2` are not null? I suppose this could be called on a dying part of the graph during igvn. > > @rwestrel but then would they not be `TOP` rather than `nullptr`? Maybe. I think the current practice is to be extra careful and assume any input can be null during igvn. What do you think @vnkozlov ? >> src/hotspot/share/opto/memnode.cpp line 3154: >> >>> 3152: } >>> 3153: ProjNode* other_proj = ctrl_s1->as_IfProj()->other_if_proj(); >>> 3154: if (other_proj->is_uncommon_trap_proj(Deoptimization::Reason_range_check) == nullptr || >> >> This could be a range check for an unrelated array I suppose. Does it matter? > > I don't think it matters, no. Do you see a scenario where it would matter? > > My argument: > It is safe to do the stores after the RC rather than before it. And if the RC trap relies on the memory state of the stores that were before the RC, then those stores simply don't lose all their uses, and stay in the graph. > After all, we only remove the "last" store by replacing it with the merged store, so the other stores only disappear if they have no other use. Is there a chance then that we store to the same element twice (once with the store that we wanted to remove but haven't and the merged store)? I don't think repeated stores like this happen anywhere else as a result of some transformation. Would it be legal wrt the java specs? Can it be observed from some other thread? I think it would be better to not have to answer these questions and find a way to do the transformation in a way that guarantees the same element is not stored to twice. Can the transformation be delayed until range check smearing has done its job? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516355973 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516352600 From chagedorn at openjdk.org Thu Mar 7 15:57:02 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Mar 2024 15:57:02 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v3] In-Reply-To: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> References: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> Message-ID: On Thu, 7 Mar 2024 15:41:20 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > missing string Extra -> Memory change That's a nice refactoring. I only have some small comments. src/hotspot/share/opto/vectorization.hpp line 456: > 454: class VLoopDependencyGraph : public StackObj { > 455: private: > 456: class DependencyNode; I'm not sure if we should declare classes in the middle of another class. Should we move the forward declaration to the top of the file as done in other places as well? src/hotspot/share/opto/vectorization.hpp line 467: > 465: > 466: // Node depth in DAG: bb_idx -> depth > 467: GrowableArray _depth; Suggestion: GrowableArray _depths; src/hotspot/share/opto/vectorization.hpp line 469: > 467: GrowableArray _depth; > 468: > 469: protected: Why is this protected? src/hotspot/share/opto/vectorization.hpp line 545: > 543: void next(); > 544: bool done() const { return _current == nullptr; } > 545: Node* current() const { assert(!done(), "not done yet"); return _current; } For two statements, I suggest to go with multiple lines: Suggestion: Node* current() const { assert(!done(), "not done yet"); return _current; } ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17812#pullrequestreview-1922553044 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516224804 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516230789 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516231365 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516412346 From epeter at openjdk.org Thu Mar 7 15:56:04 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 15:56:04 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: <1j-xpH8yy_BR50jGFKAU1bGQP2M8nlnN4kTQ70xq-7M=.bb619092-7db8-4cbd-bed9-8ebfa92a8fb4@github.com> On Tue, 5 Mar 2024 15:55:12 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > a little bit of casting for debug printing code > > Is there a chance then that we store to the same element twice ... Would it be legal wrt the java specs? > > AFAIU, introducing writes that do not exist in original program is an easy way to break JMM conformance. If we merge the writes, we have to make sure the old writes are not done. You _need_ to run jcstress on this change, at very least. Ok. I have never heard of jcstress. But will look into it. Maybe I need to do some more careful checks to ensure that on the merged path there is only the merged store, and the other stores sink into the other paths. More complicated than I thought... ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1983815904 From chagedorn at openjdk.org Thu Mar 7 15:57:04 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Mar 2024 15:57:04 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> References: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> Message-ID: On Thu, 7 Mar 2024 15:33:21 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > rename extra -> memory src/hotspot/share/opto/vectorization.cpp line 225: > 223: DependencyNode* dn = new (_arena) DependencyNode(n, memory_pred_edges, _arena); > 224: _dependency_nodes.at_put_grow(_body.bb_idx(n), dn, nullptr); > 225: } The call to `add_node()` suggests that we add a node no matter what. I therefore suggest to either change `add_node` to something like `maybe_add_node` or do the check like that: if (memory_pred_edges.is_nonempty()) { add_node(n1, memory_pred_edges); } src/hotspot/share/opto/vectorization.cpp line 285: > 283: _memory_pred_edges(nullptr) > 284: { > 285: assert(memory_pred_edges.length() > 0, "not empty"); Suggestion: assert(memory_pred_edges.is_nonempty(), "not empty"); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516366524 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516376416 From epeter at openjdk.org Thu Mar 7 15:33:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 15:33:21 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: References: Message-ID: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: rename extra -> memory ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/08b8df3f..67def8d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=00-01 Stats: 31 lines in 2 files changed: 0 ins; 0 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Thu Mar 7 16:05:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 16:05:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Thu, 7 Mar 2024 14:55:53 GMT, Roland Westrelin wrote: >> No, they are not pushed down. >> Consider the access on an int array: >> `a[invar + 1]` -> `adr = base + ARRAY_INT_BASE_OFFSET + 4 * ConvI2L(invar + 1)` >> We cannot just push the constant `1` out of the `ConvI2L`, after all `invar + 1` could overflow in the int domain ;) > > That's not quite right, I think. For instance, in this method: > > private static int test(int[] array, int i) { > return array[i + 1]; > } > > the final IR will have the `(AddP base (AddP ...) constant)` because `ConvI2LNode::Ideal` does more than checking for overflow. The actual transformation to that final shape must be delayed until after the CastII nodes are removed though. Why that's the case is puzzling actually because `CastIINode::Ideal()` has logic to push the AddI thru the `CastII` but it's disabled for range check `CastII` nodes. I noticed this while working on 8324517. My recollection was that `ConvI2LNode::Ideal` would push thru both the `CastII` and `ConvI2L` in one go so I wonder if it got broken at some point. Thanks for info, I'll look into this :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516427023 From shade at openjdk.org Thu Mar 7 15:38:57 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 7 Mar 2024 15:38:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> Message-ID: On Thu, 7 Mar 2024 15:23:45 GMT, Roland Westrelin wrote: >> I don't think it matters, no. Do you see a scenario where it would matter? >> >> My argument: >> It is safe to do the stores after the RC rather than before it. And if the RC trap relies on the memory state of the stores that were before the RC, then those stores simply don't lose all their uses, and stay in the graph. >> After all, we only remove the "last" store by replacing it with the merged store, so the other stores only disappear if they have no other use. > > Is there a chance then that we store to the same element twice (once with the store that we wanted to remove but haven't and the merged store)? I don't think repeated stores like this happen anywhere else as a result of some transformation. Would it be legal wrt the java specs? Can it be observed from some other thread? I think it would be better to not have to answer these questions and find a way to do the transformation in a way that guarantees the same element is not stored to twice. > Can the transformation be delayed until range check smearing has done its job? > Is there a chance then that we store to the same element twice ... Would it be legal wrt the java specs? AFAIU, introducing writes that do not exist in original program is an easy way to break JMM conformance. If we merge the writes, we have to make sure the old writes are not done. You _need_ to run jcstress on this change, at very least. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516374382 From chagedorn at openjdk.org Thu Mar 7 16:17:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Mar 2024 16:17:59 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: References: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> Message-ID: On Thu, 7 Mar 2024 15:31:23 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> rename extra -> memory > > src/hotspot/share/opto/vectorization.cpp line 225: > >> 223: DependencyNode* dn = new (_arena) DependencyNode(n, memory_pred_edges, _arena); >> 224: _dependency_nodes.at_put_grow(_body.bb_idx(n), dn, nullptr); >> 225: } > > The call to `add_node()` suggests that we add a node no matter what. I therefore suggest to either change `add_node` to something like `maybe_add_node` or do the check like that: > > if (memory_pred_edges.is_nonempty()) { > add_node(n1, memory_pred_edges); > } For completeness, should we also add a comment here or/and at `DependencyNode` that such a node is only created when there is no direct connection in the C2 memory graph since we would visit direct connections in `PredsIterator` anyways? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516443249 From epeter at openjdk.org Thu Mar 7 17:07:59 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 17:07:59 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Thu, 7 Mar 2024 16:03:20 GMT, Emanuel Peter wrote: >> That's not quite right, I think. For instance, in this method: >> >> private static int test(int[] array, int i) { >> return array[i + 1]; >> } >> >> the final IR will have the `(AddP base (AddP ...) constant)` because `ConvI2LNode::Ideal` does more than checking for overflow. The actual transformation to that final shape must be delayed until after the CastII nodes are removed though. Why that's the case is puzzling actually because `CastIINode::Ideal()` has logic to push the AddI thru the `CastII` but it's disabled for range check `CastII` nodes. I noticed this while working on 8324517. My recollection was that `ConvI2LNode::Ideal` would push thru both the `CastII` and `ConvI2L` in one go so I wonder if it got broken at some point. > > Thanks for info, I'll look into this :) Ah, I see what you are saying. The `AddI` can be pushed through the `ConvI2L`, but only because we know that the types are constrained. The types are constrained because of the `CastII` after the `RangeCheck`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516525702 From enikitin at openjdk.org Thu Mar 7 17:15:02 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Thu, 7 Mar 2024 17:15:02 GMT Subject: Integrated: 8327390: JitTester: Implement temporary folder functionality In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 19:58:04 GMT, Evgeny Nikitin wrote: > The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. > > This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. This pull request has now been integrated. Changeset: 5aae8030 Author: Evgeny Nikitin Committer: Leonid Mesnik URL: https://git.openjdk.org/jdk/commit/5aae80304c0b1b49341777b9da103638183877d5 Stats: 76 lines in 4 files changed: 63 ins; 4 del; 9 mod 8327390: JitTester: Implement temporary folder functionality Reviewed-by: gli, lmesnik ------------- PR: https://git.openjdk.org/jdk/pull/18128 From duke at openjdk.org Thu Mar 7 17:25:06 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 7 Mar 2024 17:25:06 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v3] In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: update implementation for avx=0 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18089/files - new: https://git.openjdk.org/jdk/pull/18089/files/0401e18e..15b36013 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=01-02 Stats: 8 lines in 2 files changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From sviswanathan at openjdk.org Thu Mar 7 18:12:54 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 7 Mar 2024 18:12:54 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v3] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Thu, 7 Mar 2024 17:25:06 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 >> >> >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 548492 | 2193497 | 4.00 >> MathBench.floorDouble | 548485 | 2192813 | 4.00 >> MathBench.rintDouble | 548488 | 2192578 | 4.00 >> MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > update implementation for avx=0 src/hotspot/cpu/x86/x86.ad line 3878: > 3876: assert(UseSSE >= 4, "required"); > 3877: if ((UseAVX == 0) && ($dst$$XMMRegister != $src$$XMMRegister)) { > 3878: __ pxor($dst$$XMMRegister, $dst$$XMMRegister); Please fix the indentation here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1516612031 From jkarthikeyan at openjdk.org Thu Mar 7 18:17:55 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 7 Mar 2024 18:17:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> Message-ID: On Thu, 7 Mar 2024 08:45:39 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Change transform to work on CMoves > > src/hotspot/share/opto/movenode.cpp line 322: > >> 320: if (phase->C->post_loop_opts_phase()) { >> 321: return nullptr; >> 322: } > > Putting the condition here would prevent any future optimization further down not to be executed. I think you should rather put this into the `is_minmax` method. Maybe this condition is now only relevant for `long`, but I think it would not hurt to also have it also for `int`, right? That's a good point, I think this will make the logic cleaner. I don't think it'll hurt it to have it for int either. > test/hotspot/jtreg/compiler/c2/irTests/TestIfMinMax.java line 139: > >> 137: public long testMaxL2E(long a, long b) { >> 138: return a <= b ? b : a; >> 139: } > > I assume some of the `long` patterns should also have become MaxL/MinL in some phase, right? Is there maybe some phase where the IR would actually show that? You can target the IR rule to a phase, I think. Would be worth a try. Oh true, I think we can identify MinL/MaxL before macro expansion is done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1516615194 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1516616923 From duke at openjdk.org Thu Mar 7 18:25:23 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 7 Mar 2024 18:25:23 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v4] In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 > > > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 548492 | 2193497 | 4.00 > MathBench.floorDouble | 548485 | 2192813 | 4.00 > MathBench.rintDouble | 548488 | 2192578 | 4.00 > MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 > > > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: fix indendation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18089/files - new: https://git.openjdk.org/jdk/pull/18089/files/15b36013..d35951f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From jkarthikeyan at openjdk.org Thu Mar 7 18:31:00 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 7 Mar 2024 18:31:00 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <7vRLiWJ_2IIkKnFbdwNqg_fKT3WYuvj7YZCXcKx1cFE=.d4c1fd64-24cc-40e0-8707-4eed20a26135@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> <7vRLiWJ_2IIkKnFbdwNqg_fKT3WYuvj7YZCXcKx1cFE=.d4c1fd64-24cc-40e0-8707-4eed20a26135@github.com> Message-ID: On Thu, 7 Mar 2024 08:43:00 GMT, Emanuel Peter wrote: >> And then you could actualy move the call to `CMoveNode::Ideal`. > > Who knows, maybe we one day extend this to other types I think moving the call to `CMoveNode::Ideal` would be a good idea, since it de-duplicates the call site. Would it still assert on non-supported types, then? I think it may make more sense if it simply filtered out the cmov types that it doesn't (currently) support. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1516631304 From sviswanathan at openjdk.org Thu Mar 7 19:06:54 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 7 Mar 2024 19:06:54 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v4] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Thu, 7 Mar 2024 18:25:23 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 >> >> >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 548492 | 2193497 | 4.00 >> MathBench.floorDouble | 548485 | 2192813 | 4.00 >> MathBench.rintDouble | 548488 | 2192578 | 4.00 >> MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > fix indendation @vamsi-parasa Thanks for these additional changes, it looks good to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1984235237 From dlong at openjdk.org Thu Mar 7 19:32:55 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 7 Mar 2024 19:32:55 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v4] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Thu, 7 Mar 2024 18:25:23 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 >> >> >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 548492 | 2193497 | 4.00 >> MathBench.floorDouble | 548485 | 2192813 | 4.00 >> MathBench.rintDouble | 548488 | 2192578 | 4.00 >> MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > fix indendation Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1923352759 From duke at openjdk.org Thu Mar 7 21:47:57 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 7 Mar 2024 21:47:57 GMT Subject: Integrated: 8327147: Improve performance of Math ceil, floor, and rint for x86 In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 > > > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 548492 | 2193497 | 4.00 > MathBench.floorDouble | 548485 | 2192813 | 4.00 > MathBench.rintDouble | 548488 | 2192578 | 4.00 > MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 > > > > > > This pull request has now been integrated. Changeset: 7c5e6e74 Author: vamsi-parasa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/7c5e6e74c8f559be919cea63ebf7004cda80ae75 Stats: 20 lines in 3 files changed: 8 ins; 11 del; 1 mod 8327147: Improve performance of Math ceil, floor, and rint for x86 Reviewed-by: jbhateja, sviswanathan, dlong ------------- PR: https://git.openjdk.org/jdk/pull/18089 From jbhateja at openjdk.org Fri Mar 8 01:14:58 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 8 Mar 2024 01:14:58 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 22:37:49 GMT, Dean Long wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> unify the implementation > > So if we can still generate the non-AVX encoding of > > `roundsd dst, src, mode` > > isn't there still a false dependency problem with `dst`? > @dean-long You bring up a very good point. The SSE instruction (roundsd dst, src, mode) also has a false dependency problem. This can be demonstrated by adding the following benchmark to MathBench.java: > > ``` > diff --git a/test/micro/org/openjdk/bench/java/lang/MathBench.java b/test/micro/org/openjdk/bench/java/lang/MathBench.java > index c7dde019154..feb472bba3d 100644 > --- a/test/micro/org/openjdk/bench/java/lang/MathBench.java > +++ b/test/micro/org/openjdk/bench/java/lang/MathBench.java > @@ -141,6 +141,11 @@ public double ceilDouble() { > return Math.ceil(double4Dot1); > } > > + @Benchmark > + public double useAfterCeilDouble() { > + return Math.ceil(double4Dot1) + Math.floor(double4Dot1); > + } > + > @Benchmark > public double copySignDouble() { > return Math.copySign(double81, doubleNegative12); > ``` > > The fix would be to do a pxor on dst before the SSE roundsd instruction, something like below: > > ``` > diff --git a/src/hotspot/cpu/x86/x86.ad b/src/hotspot/cpu/x86/x86.ad > index cf4aef83df2..eb6701f82a7 100644 > --- a/src/hotspot/cpu/x86/x86.ad > +++ b/src/hotspot/cpu/x86/x86.ad > @@ -3874,6 +3874,9 @@ instruct roundD_reg(legRegD dst, legRegD src, immU8 rmode) %{ > ins_cost(150); > ins_encode %{ > assert(UseSSE >= 4, "required"); > + if ((UseAVX == 0) && ($dst$$XMMRegister != $src$$XMMRegister)) { > + __ pxor($dst$$XMMRegister, $dst$$XMMRegister); > + } > __ roundsd($dst$$XMMRegister, $src$$XMMRegister, $rmode$$constant); > %} > ins_pipe(pipe_slow); > ``` FTR following link for more details on above issue https://github.com/openjdk/jdk/pull/16701#issuecomment-1815645570 ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1984873081 From dlong at openjdk.org Fri Mar 8 01:44:54 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 8 Mar 2024 01:44:54 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: <7-xJ8ujbaK_K90zgAFMgdWGkpnN6u8o088Wdt-YCh88=.230e0470-32e0-4da6-a185-65682d4713bb@github.com> On Mon, 4 Mar 2024 09:12:12 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. Your front-end changes require back-end changes, which are only implemented for x86 and aarch64. So you need a way to disable this for other platforms, or port the fix to all platforms. Minimizing the amount of platform-specific code required would also help. ------------- Changes requested by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17667#pullrequestreview-1923880793 From gcao at openjdk.org Fri Mar 8 02:54:56 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 02:54:56 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 02:30:20 GMT, Fei Yang wrote: >> Hi, I noticed that RISC-V missed this change from #11044 [1]: >> >> `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` >> >> `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` >> >> >> After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. >> >> [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 >> >> ### Tests >> >> - [x] Run tier1-3 tests on SiFive unmatched (release) > > Thanks! @RealFYang Thanks for your review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18131#issuecomment-1984954773 From gcao at openjdk.org Fri Mar 8 02:54:55 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 02:54:55 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 04:06:47 GMT, Gui Cao wrote: > Hi, I noticed that RISC-V missed this change from #11044 [1]: > > `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` > > `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` > > > After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. > > [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 > > ### Tests > > - [x] Run tier1-3 tests on SiFive unmatched (release) linux-riscv64 builds fine locally. GHA failure is infrastructural: https://bugs.openjdk.org/browse/JDK-8326960 ------------- PR Comment: https://git.openjdk.org/jdk/pull/18131#issuecomment-1984954480 From gcao at openjdk.org Fri Mar 8 03:00:59 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 03:00:59 GMT Subject: Integrated: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 04:06:47 GMT, Gui Cao wrote: > Hi, I noticed that RISC-V missed this change from #11044 [1]: > > `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` > > `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` > > > After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. > > [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 > > ### Tests > > - [x] Run tier1-3 tests on SiFive unmatched (release) This pull request has now been integrated. Changeset: de428daf Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/de428daf9adef5afe7347319f7a6f6732e9b6c4b Stats: 16 lines in 1 file changed: 6 ins; 7 del; 3 mod 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/18131 From gcao at openjdk.org Fri Mar 8 03:24:55 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 03:24:55 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v2] In-Reply-To: References: Message-ID: On Wed, 28 Feb 2024 13:26:04 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > modify test config src/hotspot/cpu/riscv/riscv_v.ad line 3237: > 3235: // VectorCastS2X, VectorUCastS2X > 3236: > 3237: instruct vcvtStoB(vReg dst, vReg src) %{ Hi, Should use vcvtStoX instead of vcvtStoB? src/hotspot/cpu/riscv/riscv_v.ad line 3245: > 3243: match(Set dst (VectorCastS2X src)); > 3244: effect(TEMP_DEF dst); > 3245: format %{ "vcvtStoB $dst, $src" %} And Here, vcvtStoX can be used instead of vcvtStoB. test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastRVV.java line 37: > 35: * @modules java.base/jdk.internal.misc > 36: * @summary Test that vector cast intrinsics work as intended on riscv (rvv). > 37: * @requires os.arch == "riscv64" & vm.cpu.features ~= ".*v,.*" is it possible to match rvc here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517116676 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517117063 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517117349 From jkarthikeyan at openjdk.org Fri Mar 8 03:26:12 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 8 Mar 2024 03:26:12 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v7] In-Reply-To: References: Message-ID: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Move logic to CMoveNode::Ideal and improve IR test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17574/files - new: https://git.openjdk.org/jdk/pull/17574/files/2adebb73..f929239a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=05-06 Stats: 67 lines in 4 files changed: 15 ins; 32 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/17574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17574/head:pull/17574 PR: https://git.openjdk.org/jdk/pull/17574 From jkarthikeyan at openjdk.org Fri Mar 8 03:26:12 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 8 Mar 2024 03:26:12 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 06:13:02 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Change transform to work on CMoves Thanks for taking another look! I've pushed a commit that should address the points brought up in review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1984975912 From jkarthikeyan at openjdk.org Fri Mar 8 06:14:57 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 8 Mar 2024 06:14:57 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v7] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 03:26:12 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Move logic to CMoveNode::Ideal and improve IR test Actually, while experimenting with Min/Max identities I found a case that the current CMove code couldn't fully transform: private static long test(long a, long b) { return Math.max(a, Math.max(a, b)); } Currently, only the second `Math.max` is transformed into a CMove- the first one is left as-is. The issue seems to be that the CMove code is trying to mistakenly use the more conservative loop-based heuristic instead of the one for straight-line code, even though there is no loop. This seems to happen in two places, first here: https://github.com/openjdk/jdk/blob/de428daf9adef5afe7347319f7a6f6732e9b6c4b/src/hotspot/share/opto/loopopts.cpp#L701-L704 This logic seems to be inverted, as it's checking if the region's enclosing loop is the root of the loop tree, or otherwise not a loop. It seems to be `true` if it's *not* in a loop, and `false` when it *is* in a loop. This also looks to be corroborated by the [JVM Anatomy Quarks on CMove](https://shipilev.net/jvm/anatomy-quarks/30-conditional-moves/) linked earlier where CMove only kicks in when the branch percent is >18% or <82%, which was the logic for loop CMoves before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451), even though `doCall` doesn't contain loops. I think this is a pretty simple fix to just invert the boolean expression. Then, there's a second place it happens, here: https://github.com/openjdk/jdk/blob/de428daf9adef5afe7347319f7a6f6732e9b6c4b/src/hotspot/share/opto/loopopts.cpp#L764-L775 Here, it sees if any consumers of the phi are a Cmp or Encode/DecodeNarrowOop, to delay the transform to split-if. In this case, the second If's Cmp consumes the phi so this code path is triggered. I'm less sure of what to do for this case, though. In this case I would say that it's being triggered in error, but there may be other cases where there is a benefit. I think the min/max transform should still be done after the CMove transform, but I think it'll be a good idea to look at this separately because it could have a widespread impact. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1985098406 From epeter at openjdk.org Fri Mar 8 06:18:00 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:18:00 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v3] In-Reply-To: References: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> Message-ID: <_y7_kzXP48f_sv3riZeAQ4iypp70t97a_5tRQy5VIfw=.f798f117-e84b-46ea-b894-717a6ef496bf@github.com> On Thu, 7 Mar 2024 14:15:21 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> missing string Extra -> Memory change > > src/hotspot/share/opto/vectorization.hpp line 456: > >> 454: class VLoopDependencyGraph : public StackObj { >> 455: private: >> 456: class DependencyNode; > > I'm not sure if we should declare classes in the middle of another class. Should we move the forward declaration to the top of the file as done in other places as well? I see this pattern in other places in the codebase: class ciTypeFlow : public ArenaObj { private: ciEnv* _env; ciMethod* _method; int _osr_bci; bool _has_irreducible_entry; const char* _failure_reason; public: class StateVector; class Loop; class Block; I think it makes sense to declare "internal" (private) classes at the beginning of the class. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517208777 From epeter at openjdk.org Fri Mar 8 06:32:28 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:32:28 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v4] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: add_node change for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/31b65c6c..dd91e22e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=02-03 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 06:36:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:36:22 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v5] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: _depth -> _depths for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/dd91e22e..1eeced11 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=03-04 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 06:47:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:47:07 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v6] In-Reply-To: References: Message-ID: <2XwLSUCYSveLeQkqv0VynZ-UcjASyW_-jXpzOrjlGzg=.b5a6dda2-9d31-49df-a4c0-26b4f4945ef4@github.com> > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Apply from Christian's suggestions Co-authored-by: Christian Hagedorn - remove body() accessor from VLoopDependencyGraph, use field directly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/1eeced11..c3915bd1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=04-05 Stats: 9 lines in 2 files changed: 3 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 06:47:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:47:07 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v6] In-Reply-To: References: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> Message-ID: On Thu, 7 Mar 2024 14:19:23 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - Apply from Christian's suggestions >> >> Co-authored-by: Christian Hagedorn >> - remove body() accessor from VLoopDependencyGraph, use field directly > > src/hotspot/share/opto/vectorization.hpp line 467: > >> 465: >> 466: // Node depth in DAG: bb_idx -> depth >> 467: GrowableArray _depth; > > Suggestion: > > GrowableArray _depths; done > src/hotspot/share/opto/vectorization.hpp line 469: > >> 467: GrowableArray _depth; >> 468: >> 469: protected: > > Why is this protected? Ha. I thought I needed if for accessing by the inner class in `VLoopDependencyGraph::PredsIterator::next`. But I can directly access the field there. Removing the accessor. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517224320 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517223449 From epeter at openjdk.org Fri Mar 8 06:47:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:47:07 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: References: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> Message-ID: On Thu, 7 Mar 2024 16:14:38 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/vectorization.cpp line 225: >> >>> 223: DependencyNode* dn = new (_arena) DependencyNode(n, memory_pred_edges, _arena); >>> 224: _dependency_nodes.at_put_grow(_body.bb_idx(n), dn, nullptr); >>> 225: } >> >> The call to `add_node()` suggests that we add a node no matter what. I therefore suggest to either change `add_node` to something like `maybe_add_node` or do the check like that: >> >> if (memory_pred_edges.is_nonempty()) { >> add_node(n1, memory_pred_edges); >> } > > For completeness, should we also add a comment here or/and at `DependencyNode` that such a node is only created when there is no direct connection in the C2 memory graph since we would visit direct connections in `PredsIterator` anyways? Wrote this now: if (memory_pred_edges.is_nonempty()) { // Data edges are taken implicitly from the C2 graph, thus we only add // a dependency node if we have memory edges. add_node(n1, memory_pred_edges); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517226544 From epeter at openjdk.org Fri Mar 8 07:48:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 07:48:22 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v7] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: rm trailing whitespaces from applied suggestion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/c3915bd1..cf4996b9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=05-06 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 07:58:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 07:58:22 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v6] In-Reply-To: References: Message-ID: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: - Merge branch 'master' into JDK-8309267 - Apply suggestions for comments by Vladimir - Update LoopArrayIndexComputeTest.java copyright year - Update src/hotspot/share/opto/superword.cpp - SplitStatus::Kind enum - SplitTask::Kind enum - manual merge - more fixes for TestSplitPacks.java - fix some IR rules in TestSplitPacks.java - fix MulAddS2I - ... and 23 more: https://git.openjdk.org/jdk/compare/de428daf...77e3d47a ------------- Changes: https://git.openjdk.org/jdk/pull/17848/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=05 Stats: 1268 lines in 5 files changed: 1206 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/17848.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17848/head:pull/17848 PR: https://git.openjdk.org/jdk/pull/17848 From epeter at openjdk.org Fri Mar 8 07:59:20 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 07:59:20 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: - Merge branch 'master' into JDK-8325651 - rm trailing whitespaces from applied suggestion - Apply from Christian's suggestions Co-authored-by: Christian Hagedorn - remove body() accessor from VLoopDependencyGraph, use field directly - _depth -> _depths for Christian - add_node change for Christian - missing string Extra -> Memory change - rename extra -> memory - typo - fix depth of Phi node - ... and 14 more: https://git.openjdk.org/jdk/compare/3c412c1e...d89119e1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/cf4996b9..d89119e1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=06-07 Stats: 89095 lines in 1876 files changed: 10366 ins; 73809 del; 4920 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From duke at openjdk.org Fri Mar 8 09:56:06 2024 From: duke at openjdk.org (Oussama Louati) Date: Fri, 8 Mar 2024 09:56:06 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v6] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Refactor byte array parameter in generateBytecodes method ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/89292423..5fd2d743 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From mli at openjdk.org Fri Mar 8 12:06:05 2024 From: mli at openjdk.org (Hamlin Li) Date: Fri, 8 Mar 2024 12:06:05 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension Message-ID: Hi, Can you review this simple patch? Thanks FYI: test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 ------------- Commit messages: - Initial commit Changes: https://git.openjdk.org/jdk/pull/18169/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18169&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327689 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18169.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18169/head:pull/18169 PR: https://git.openjdk.org/jdk/pull/18169 From mli at openjdk.org Fri Mar 8 12:17:15 2024 From: mli at openjdk.org (Hamlin Li) Date: Fri, 8 Mar 2024 12:17:15 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/594927fb..646955f0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From mli at openjdk.org Fri Mar 8 12:17:15 2024 From: mli at openjdk.org (Hamlin Li) Date: Fri, 8 Mar 2024 12:17:15 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v2] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 03:20:34 GMT, Gui Cao wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> modify test config > > src/hotspot/cpu/riscv/riscv_v.ad line 3237: > >> 3235: // VectorCastS2X, VectorUCastS2X >> 3236: >> 3237: instruct vcvtStoB(vReg dst, vReg src) %{ > > Hi, Should use vcvtStoX instead of vcvtStoB? Thanks, fixed the typo. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517640253 From fyang at openjdk.org Fri Mar 8 13:55:52 2024 From: fyang at openjdk.org (Fei Yang) Date: Fri, 8 Mar 2024 13:55:52 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: <9aXXsh96SkVbXDfscianv13ZteF0sdJ1if-NlPRwuZI=.11b73051-af20-4732-a500-6649027b1fd8@github.com> On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18169#pullrequestreview-1924928132 From ddong at openjdk.org Fri Mar 8 14:13:14 2024 From: ddong at openjdk.org (Denghui Dong) Date: Fri, 8 Mar 2024 14:13:14 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code Message-ID: Hi, Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. Thanks ------------- Commit messages: - 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code Changes: https://git.openjdk.org/jdk/pull/18170/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18170&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327693 Stats: 25 lines in 2 files changed: 12 ins; 9 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18170.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18170/head:pull/18170 PR: https://git.openjdk.org/jdk/pull/18170 From duke at openjdk.org Fri Mar 8 17:17:19 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Mar 2024 17:17:19 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v12] In-Reply-To: References: Message-ID: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 27 commits: - add missing avx_ifma in amd64.java - Merge branch 'master' of https://github.com/vamsi-parasa/jdk into jdk_poly - update asserts for vpmadd52l/hq - Update description of Poly1305 algo - add cpuinfo test for avx_ifma - fix checks for vpmadd52* - fix use_vl to true for vpmadd52* instrs - fix merge issues with avx_ifma - Merge branch 'master' of https://git.openjdk.java.net/jdk into jdk_poly - removed unused merge, faster and, redundant mov - ... and 17 more: https://git.openjdk.org/jdk/compare/5aae8030...35d39dc5 ------------- Changes: https://git.openjdk.org/jdk/pull/17881/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=11 Stats: 810 lines in 10 files changed: 800 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17881.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17881/head:pull/17881 PR: https://git.openjdk.org/jdk/pull/17881 From duke at openjdk.org Fri Mar 8 17:19:58 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Mar 2024 17:19:58 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v11] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 21:40:04 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > update asserts for vpmadd52l/hq Planning to integrate this PR by Monday. Could you please let me know if there are any objections? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17881#issuecomment-1986094404 From jbhateja at openjdk.org Fri Mar 8 17:58:55 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 8 Mar 2024 17:58:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> Message-ID: An HTML attachment was scrubbed... URL: From duke at openjdk.org Fri Mar 8 18:36:55 2024 From: duke at openjdk.org (Joshua Cao) Date: Fri, 8 Mar 2024 18:36:55 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 08:28:19 GMT, Emanuel Peter wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> comments with explanations and style changes > > src/hotspot/share/opto/subnode.cpp line 1586: > >> 1584: } >> 1585: } >> 1586: } > > This looks like heavy code duplication. Can you refactor this? Maybe a helper method? I can post a version of this so we can see what it looks like. I actually did this first, but the code got quite ugly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1518126089 From duke at openjdk.org Fri Mar 8 18:56:07 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Mar 2024 18:56:07 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: make vpmadd52l/hq generic ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17881/files - new: https://git.openjdk.org/jdk/pull/17881/files/35d39dc5..4d3e0ebb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=11-12 Stats: 14 lines in 1 file changed: 14 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17881.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17881/head:pull/17881 PR: https://git.openjdk.org/jdk/pull/17881 From sviswanathan at openjdk.org Fri Mar 8 23:44:52 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 8 Mar 2024 23:44:52 GMT Subject: RFR: 8327041: Incorrect lane size references in avx512 instructions. In-Reply-To: References: Message-ID: On Thu, 29 Feb 2024 11:09:09 GMT, Jatin Bhateja wrote: > - As per AVX-512 instruction format, a memory operand instruction can use compressed disp8*N encoding. > - For instructions which reads/writes entire vector from/to memory, scaling factor (N) computation only takes into account vector length and is not dependent on vector lane sizes[1]. > - Patch fixes incorrect lane size references from various x86 assembler routines, this is not a functionality bug, but correcting the lane size will make the code compliant with AVX-512 instruction format specification. > > [1] Intel SDM, Volume 2, Section 2.7.5 Table 2-35 > https://cdrdv2.intel.com/v1/dl/getContent/671200 Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18059#pullrequestreview-1926014748 From vlivanov at openjdk.org Sat Mar 9 02:33:03 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 9 Mar 2024 02:33:03 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 07:59:20 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: > > - Merge branch 'master' into JDK-8325651 > - rm trailing whitespaces from applied suggestion > - Apply from Christian's suggestions > > Co-authored-by: Christian Hagedorn > - remove body() accessor from VLoopDependencyGraph, use field directly > - _depth -> _depths for Christian > - add_node change for Christian > - missing string Extra -> Memory change > - rename extra -> memory > - typo > - fix depth of Phi node > - ... and 14 more: https://git.openjdk.org/jdk/compare/dcaa9b26...d89119e1 Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17812#pullrequestreview-1926075439 From vlivanov at openjdk.org Sat Mar 9 02:37:01 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 9 Mar 2024 02:37:01 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v6] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 07:58:22 GMT, Emanuel Peter wrote: >> After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. >> There are multiple reason for that: >> >> - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: >> >> X X X X Y Y Y Y >> Z Z Z Z Z Z Z Z >> >> >> - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. >> >> - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: >> https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 >> >> Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. >> >> **Further Work** >> >> [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize >> The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: > > - Merge branch 'master' into JDK-8309267 > - Apply suggestions for comments by Vladimir > - Update LoopArrayIndexComputeTest.java copyright year > - Update src/hotspot/share/opto/superword.cpp > - SplitStatus::Kind enum > - SplitTask::Kind enum > - manual merge > - more fixes for TestSplitPacks.java > - fix some IR rules in TestSplitPacks.java > - fix MulAddS2I > - ... and 23 more: https://git.openjdk.org/jdk/compare/de428daf...77e3d47a Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17848#pullrequestreview-1926075988 From gli at openjdk.org Sat Mar 9 05:01:52 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 9 Mar 2024 05:01:52 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18169#pullrequestreview-1926092454 From jbhateja at openjdk.org Sat Mar 9 07:13:55 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Mar 2024 07:13:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 18:56:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > make vpmadd52l/hq generic Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1926118001 From jbhateja at openjdk.org Sat Mar 9 07:13:55 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Mar 2024 07:13:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> Message-ID: On Fri, 8 Mar 2024 17:56:30 GMT, Jatin Bhateja wrote: > > [poly1305_spr_validation.patch](https://github.com/openjdk/jdk/files/14496404/poly1305_spr_validation.patch) > > Hi @vamsi-parasa , We do not want EVEX to VEX demotions for these newly added instruction on AVX512_IFMA targets since there are no VEX equivalent versions of these instructions, please pick the relevant fixes for assembler routines from my above patch. As @sviswa7 mentioned we should make these instruction generic. Thanks @vamsi-parasa ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1518500977 From jbhateja at openjdk.org Sat Mar 9 07:14:56 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Mar 2024 07:14:56 GMT Subject: Integrated: 8327041: Incorrect lane size references in avx512 instructions. In-Reply-To: References: Message-ID: On Thu, 29 Feb 2024 11:09:09 GMT, Jatin Bhateja wrote: > - As per AVX-512 instruction format, a memory operand instruction can use compressed disp8*N encoding. > - For instructions which reads/writes entire vector from/to memory, scaling factor (N) computation only takes into account vector length and is not dependent on vector lane sizes[1]. > - Patch fixes incorrect lane size references from various x86 assembler routines, this is not a functionality bug, but correcting the lane size will make the code compliant with AVX-512 instruction format specification. > > [1] Intel SDM, Volume 2, Section 2.7.5 Table 2-35 > https://cdrdv2.intel.com/v1/dl/getContent/671200 This pull request has now been integrated. Changeset: 2d4c757e Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/2d4c757e2e03b753135d564e9f2761052fdcb189 Stats: 81 lines in 1 file changed: 0 ins; 0 del; 81 mod 8327041: Incorrect lane size references in avx512 instructions. Reviewed-by: sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/18059 From gli at openjdk.org Sat Mar 9 09:12:56 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 9 Mar 2024 09:12:56 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks Nice found. Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18170#pullrequestreview-1926165582 From ksakata at openjdk.org Mon Mar 11 01:39:53 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Mon, 11 Mar 2024 01:39:53 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: <129mUYDenCcF6DZA7S2Ak8WkZfe8r7_VIyP2dB3VMug=.770cc17b-3de9-490f-bc71-cd77fdc973de@github.com> On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. Could someone please review this pull request? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-1987477446 From gcao at openjdk.org Mon Mar 11 02:39:03 2024 From: gcao at openjdk.org (Gui Cao) Date: Mon, 11 Mar 2024 02:39:03 GMT Subject: RFR: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint Message-ID: Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. ### Tests - [x] Run tier1-3 tests on on LicheePI 4A (release) - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) ------------- Commit messages: - Merge remote-tracking branch 'upstream/master' into JDK-8327716 - 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint Changes: https://git.openjdk.org/jdk/pull/18175/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18175&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327716 Stats: 23 lines in 3 files changed: 0 ins; 0 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/18175.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18175/head:pull/18175 PR: https://git.openjdk.org/jdk/pull/18175 From fyang at openjdk.org Mon Mar 11 02:55:52 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 02:55:52 GMT Subject: RFR: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint In-Reply-To: References: Message-ID: On Sat, 9 Mar 2024 09:39:38 GMT, Gui Cao wrote: > Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. > > ### Tests > - [x] Run tier1-3 tests on on LicheePI 4A (release) > - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) Looks good. Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18175#pullrequestreview-1926835684 From fyang at openjdk.org Mon Mar 11 04:04:53 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 04:04:53 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:17:15 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Hi, I have one comment after a brief look. src/hotspot/cpu/riscv/riscv_v.ad line 3397: > 3395: predicate(Matcher::vector_element_basic_type(n) == T_FLOAT); > 3396: match(Set dst (VectorCastL2X src)); > 3397: effect(TEMP_DEF dst); I see you added `TEMP_DEF dst` for some existing instructs like this one here. Do we really need it? I don't see such a need when reading the overlap constraints on vector operands from the RVV spec [1]: A destination vector register group can overlap a source vector register group only if one of the following holds: The destination EEW equals the source EEW. The destination EEW is smaller than the source EEW and the overlap is in the lowest-numbered part of the source register group (e.g., when LMUL=1, vnsrl.wi v0, v0, 3 is legal, but a destination of v1 is not). The destination EEW is greater than the source EEW, the source EMUL is at least 1, and the overlap is in the highest-numbered part of the destination register group (e.g., when LMUL=8, vzext.vf4 v0, v6 is legal, but a source of v0, v2, or v4 is not). [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vec-operands ------------- PR Review: https://git.openjdk.org/jdk/pull/18040#pullrequestreview-1926897564 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1519134205 From fyang at openjdk.org Mon Mar 11 04:11:52 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 04:11:52 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 FYI: The GHA linux-cross-compile for linux-riscv64 is back working again. You might want to merge and retrigger the GHA. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18169#issuecomment-1987600853 From chagedorn at openjdk.org Mon Mar 11 07:09:00 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 11 Mar 2024 07:09:00 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 07:59:20 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: > > - Merge branch 'master' into JDK-8325651 > - rm trailing whitespaces from applied suggestion > - Apply from Christian's suggestions > > Co-authored-by: Christian Hagedorn > - remove body() accessor from VLoopDependencyGraph, use field directly > - _depth -> _depths for Christian > - add_node change for Christian > - missing string Extra -> Memory change > - rename extra -> memory > - typo > - fix depth of Phi node > - ... and 14 more: https://git.openjdk.org/jdk/compare/75213358...d89119e1 Thanks for the updates! Looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17812#pullrequestreview-1927050290 From epeter at openjdk.org Mon Mar 11 07:15:05 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 07:15:05 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: <1wli0OuvENorCvLhLQhsba5e81CXKuDOuBLgh_2i75U=.031681ce-4e40-48f6-8771-c318c422f88e@github.com> On Mon, 11 Mar 2024 07:06:02 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8325651 >> - rm trailing whitespaces from applied suggestion >> - Apply from Christian's suggestions >> >> Co-authored-by: Christian Hagedorn >> - remove body() accessor from VLoopDependencyGraph, use field directly >> - _depth -> _depths for Christian >> - add_node change for Christian >> - missing string Extra -> Memory change >> - rename extra -> memory >> - typo >> - fix depth of Phi node >> - ... and 14 more: https://git.openjdk.org/jdk/compare/05cf327d...d89119e1 > > Thanks for the updates! Looks good. Thanks @chhagedorn @iwanowww for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17812#issuecomment-1987758567 From epeter at openjdk.org Mon Mar 11 07:15:06 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 07:15:06 GMT Subject: Integrated: 8325651: C2 SuperWord: refactor the dependency graph In-Reply-To: References: Message-ID: <-1qOCk3kq-DdEVHi91xUNrbVkxD4IRCFlLekmfUHqRM=.4d705f00-32d6-4ce7-b7fc-1d5b8caf43bf@github.com> On Mon, 12 Feb 2024 16:24:30 GMT, Emanuel Peter wrote: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... This pull request has now been integrated. Changeset: ca5ca85d Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/ca5ca85d2408abfcb8a37f16476dba13c3b474d0 Stats: 713 lines in 5 files changed: 285 ins; 404 del; 24 mod 8325651: C2 SuperWord: refactor the dependency graph Reviewed-by: chagedorn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Mon Mar 11 07:36:11 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 07:36:11 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v7] In-Reply-To: References: Message-ID: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 34 commits: - manual merge master - Merge branch 'master' into JDK-8309267 - Apply suggestions for comments by Vladimir - Update LoopArrayIndexComputeTest.java copyright year - Update src/hotspot/share/opto/superword.cpp - SplitStatus::Kind enum - SplitTask::Kind enum - manual merge - more fixes for TestSplitPacks.java - fix some IR rules in TestSplitPacks.java - ... and 24 more: https://git.openjdk.org/jdk/compare/ca5ca85d...efab8718 ------------- Changes: https://git.openjdk.org/jdk/pull/17848/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=06 Stats: 1268 lines in 5 files changed: 1206 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/17848.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17848/head:pull/17848 PR: https://git.openjdk.org/jdk/pull/17848 From fyang at openjdk.org Mon Mar 11 07:45:58 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 07:45:58 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:17:15 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > fix typo src/hotspot/cpu/riscv/riscv_v.ad line 3220: > 3218: ins_encode %{ > 3219: BasicType bt = Matcher::vector_element_basic_type(this); > 3220: if (is_floating_point_type(bt)) { Could `bt` (the vector element basic type) be floating point type for `VectorUCastB2X` node? I see our aarch64 counterpart has this assertion: `assert(bt == T_SHORT || bt == T_INT || bt == T_LONG, "must be");` [1]. Same question for `VectorUCastS2X` and `VectorUCastI2X` nodes. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L3752 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1519263293 From chagedorn at openjdk.org Mon Mar 11 10:10:00 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 11 Mar 2024 10:10:00 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v7] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:36:11 GMT, Emanuel Peter wrote: >> After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. >> There are multiple reason for that: >> >> - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: >> >> X X X X Y Y Y Y >> Z Z Z Z Z Z Z Z >> >> >> - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. >> >> - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: >> https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 >> >> Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. >> >> **Further Work** >> >> [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize >> The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 34 commits: > > - manual merge master > - Merge branch 'master' into JDK-8309267 > - Apply suggestions for comments by Vladimir > - Update LoopArrayIndexComputeTest.java copyright year > - Update src/hotspot/share/opto/superword.cpp > - SplitStatus::Kind enum > - SplitTask::Kind enum > - manual merge > - more fixes for TestSplitPacks.java > - fix some IR rules in TestSplitPacks.java > - ... and 24 more: https://git.openjdk.org/jdk/compare/ca5ca85d...efab8718 Apart from some minor comments, looks good to me, too! src/hotspot/share/opto/superword.cpp line 1582: > 1580: } > 1581: > 1582: // Split packs that have mutual dependence, until all packs are mutually_independent. Suggestion: // Split packs that have a mutual dependency, until all packs are mutually_independent. src/hotspot/share/opto/superword.cpp line 1590: > 1588: if (!is_marked_reduction(pack->at(0)) && > 1589: !mutually_independent(pack)) { > 1590: // Split in half. Maybe you could add a comment here that splitting in half is a best guess/intuitive way to continue src/hotspot/share/opto/superword.cpp line 3017: > 3015: if (!is_reduction_pack && > 3016: (!has_use_pack_superset(n0, n1) || > 3017: !has_use_pack_superset(n1, n0))) { Was first tricked by missing the inversion of the result. Maybe you can flip it and rename it to `has_no_use_pack_superset()`? src/hotspot/share/opto/superword.hpp line 339: > 337: const char* message() const { return _message; } > 338: > 339: int split_size() const { Should be `uint`: Suggestion: uint split_size() const { src/hotspot/share/opto/superword.hpp line 393: > 391: void split_packs(const char* split_name, SplitStrategy strategy); > 392: > 393: // Split packs at boundaries where left and right have different use or def packs. Just a general note, I'm not sure if you need to repeat the comments here when they are identical to the ones found at the definition in the source file. But I guess it does not hurt either. If you only want to keep one, I'd prefer to have the comments in the source file. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17848#pullrequestreview-1927154552 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519309037 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519459587 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519294500 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519296397 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519306208 From epeter at openjdk.org Mon Mar 11 10:32:10 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 10:32:10 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v8] In-Reply-To: References: Message-ID: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17848/files - new: https://git.openjdk.org/jdk/pull/17848/files/efab8718..747e2f03 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=06-07 Stats: 10 lines in 2 files changed: 1 ins; 4 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/17848.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17848/head:pull/17848 PR: https://git.openjdk.org/jdk/pull/17848 From rcastanedalo at openjdk.org Mon Mar 11 10:47:15 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 11 Mar 2024 10:47:15 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect Message-ID: This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. #### Testing - tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. ------------- Commit messages: - Add additional temporary register Changes: https://git.openjdk.org/jdk/pull/18183/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18183&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8326385 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18183.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18183/head:pull/18183 PR: https://git.openjdk.org/jdk/pull/18183 From ddong at openjdk.org Mon Mar 11 11:21:18 2024 From: ddong at openjdk.org (Denghui Dong) Date: Mon, 11 Mar 2024 11:21:18 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled Message-ID: Hi, Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. I am not very sure if I have changed all the places that should be. Performance: I wrote a simple JMH included in this patch. On Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Before this change: Benchmark Mode Cnt Score Error Units C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.270 ? 0.011 ns/op C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.479 ? 0.012 ns/op After this change: Benchmark Mode Cnt Score Error Units C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.264 ? 0.006 ns/op C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.057 ? 0.005 ns/op Testing: fastdebug tier1-4 on Linux x64 ------------- Commit messages: - add a jmh test - update comment - fix failure and update header - update comment - 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled Changes: https://git.openjdk.org/jdk/pull/18167/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327661 Stats: 134 lines in 5 files changed: 109 ins; 1 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/18167.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18167/head:pull/18167 PR: https://git.openjdk.org/jdk/pull/18167 From chagedorn at openjdk.org Mon Mar 11 11:25:55 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 11 Mar 2024 11:25:55 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v8] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 10:32:10 GMT, Emanuel Peter wrote: >> After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. >> There are multiple reason for that: >> >> - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: >> >> X X X X Y Y Y Y >> Z Z Z Z Z Z Z Z >> >> >> - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. >> >> - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: >> https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 >> >> Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. >> >> **Further Work** >> >> [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize >> The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > for Christian Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17848#pullrequestreview-1927573214 From mli at openjdk.org Mon Mar 11 12:16:01 2024 From: mli at openjdk.org (Hamlin Li) Date: Mon, 11 Mar 2024 12:16:01 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 04:09:03 GMT, Fei Yang wrote: > FYI: The GHA linux-cross-compile for linux-riscv64 is back working again. You might want to merge and retrigger the GHA. Thanks for reminding. Just FYI, it still failed, https://github.com/Hamlin-Li/jdk/actions/runs/8202918492/job/22510046373 Thanks @RealFYang @lgxbslgx for your reviewing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18169#issuecomment-1988299203 PR Comment: https://git.openjdk.org/jdk/pull/18169#issuecomment-1988299670 From mli at openjdk.org Mon Mar 11 12:16:02 2024 From: mli at openjdk.org (Hamlin Li) Date: Mon, 11 Mar 2024 12:16:02 GMT Subject: Integrated: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 This pull request has now been integrated. Changeset: 680ac2ce Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/680ac2cebecf93e5924a441a5de6918cd7adf118 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod 8327689: RISC-V: adjust test filters of zfh extension Reviewed-by: fyang, gli ------------- PR: https://git.openjdk.org/jdk/pull/18169 From aboldtch at openjdk.org Mon Mar 11 13:04:52 2024 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Mon, 11 Mar 2024 13:04:52 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. lgtm. > * tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. Guess it meant to say `macosx-aarch64` ------------- Marked as reviewed by aboldtch (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18183#pullrequestreview-1927766977 From dnsimon at openjdk.org Mon Mar 11 13:10:16 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 13:10:16 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass Message-ID: This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. ------------- Commit messages: - fix javadoc for return type of ResolvedJavaType.hasFinalizableSubclass Changes: https://git.openjdk.org/jdk/pull/18192/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18192&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327790 Stats: 5 lines in 1 file changed: 2 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18192.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18192/head:pull/18192 PR: https://git.openjdk.org/jdk/pull/18192 From rcastanedalo at openjdk.org Mon Mar 11 13:31:51 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 11 Mar 2024 13:31:51 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:29 GMT, Axel Boldt-Christmas wrote: > lgtm. Thanks for reviewing, Axel! > > * tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. > > Guess it meant to say `macosx-aarch64` Right, good catch, updated in the description. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18183#issuecomment-1988446387 From gdub at openjdk.org Mon Mar 11 13:40:55 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Mon, 11 Mar 2024 13:40:55 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: <2JC64ehgjieSzyJgc42twIZ47NKvPm7WlYM08Jxl-hU=.009e98f2-9e4b-4e6f-ac99-a8562395b0d1@github.com> On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Marked as reviewed by gdub (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18192#pullrequestreview-1927851330 From rkennke at openjdk.org Mon Mar 11 14:03:52 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 11 Mar 2024 14:03:52 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. I've got a question. Also, what about the other arches? src/hotspot/cpu/aarch64/aarch64.ad line 16022: > 16020: %} > 16021: > 16022: instruct cmpFastLockLightweight(rFlagsReg cr, iRegP object, iRegP box, iRegPNoSp tmp, iRegPNoSp tmp2, iRegPNoSp tmp3) Do we need to specify the box register at all, if we never use it? It means that the register allocator assigns an actual register to it, right? This could be a problem in workloads that are both locking-intensive *and* with high register pressure. You may just not see it with dacapo, etc, because aarch64 has so many registers to begin with. ------------- PR Review: https://git.openjdk.org/jdk/pull/18183#pullrequestreview-1927918533 PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1519772909 From aboldtch at openjdk.org Mon Mar 11 16:32:17 2024 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Mon, 11 Mar 2024 16:32:17 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> References: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> Message-ID: On Mon, 11 Mar 2024 14:01:42 GMT, Roman Kennke wrote: > Also, what about the other arches? RISC-V and x64/x86 both binds the box to a specific register so it can be (and is) specified as `USE_KILL`. PPC64 (and aarch64 after this pr) uses an extra register allocation, and does not kill the box. > Do we need to specify the box register at all, if we never use it? I believe that would require rewriting large parts of the C2 FastLockNode. It is modelled as a CmpNode. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18183#issuecomment-1988596188 From rcastanedalo at openjdk.org Mon Mar 11 16:32:22 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 11 Mar 2024 16:32:22 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> References: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> Message-ID: On Mon, 11 Mar 2024 14:01:10 GMT, Roman Kennke wrote: >> This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. >> >> Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. >> >> #### Testing >> >> - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. > > src/hotspot/cpu/aarch64/aarch64.ad line 16022: > >> 16020: %} >> 16021: >> 16022: instruct cmpFastLockLightweight(rFlagsReg cr, iRegP object, iRegP box, iRegPNoSp tmp, iRegPNoSp tmp2, iRegPNoSp tmp3) > > Do we need to specify the box register at all, if we never use it? It means that the register allocator assigns an actual register to it, right? This could be a problem in workloads that are both locking-intensive *and* with high register pressure. You may just not see it with dacapo, etc, because aarch64 has so many registers to begin with. There is, unfortunately, no ADL construction to specify that an operand such as `box` is not used at all. Yes, the register allocator will assign a register to `box`. Avoiding this would require fairly intrusive changes in C2 (add lightweight locking-specific, single-input versions of the `FastLock` and `FastUnlock` nodes and adapt all the C2 logic that deals with them), which I think would be best addressed in a separate RFE (assuming the additional register pressure is a problem in practice). Furthermore, to my understanding, the box operand is likely to be needed again in the context of Lilliput's [OMWorld](https://bugs.openjdk.org/browse/JDK-8326750) sub-project. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1519942947 From kxu at openjdk.org Mon Mar 11 16:42:17 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 11 Mar 2024 16:42:17 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value Message-ID: This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. ------------- Commit messages: - add license header - also test for correctness - exclude x86 from tests - refactor (x & m) u<= m transformation and add test Changes: https://git.openjdk.org/jdk/pull/18198/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327381 Stats: 111 lines in 2 files changed: 93 ins; 17 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From epeter at openjdk.org Mon Mar 11 16:42:37 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 16:42:37 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert Message-ID: The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. ---------- **Problem** By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: the PURPLE main and post loops are already previously removed as empty-loops. At the time of the assert, the graph looks like this: ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. The loop-tree looks essencially like this: (rr) p _ltree_root->dump() Loop: N0/N0 has_sfpt Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } This is basically: 415 pre orange 298 pre PURPLE 200 main orange 398 post orange >From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... ...but not the same pre-loop as `cl` -> the `assert` fires. It seems that we assume in the code, that we can check the `_next->_head`, and if: 1) it is a main-loop and 2) that main-loop still has a pre-loop then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". The loop-tree is correct here, and this is how it was arrived at: "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. So depending on what turn we take at this "427 If", we either get the order: 415 pre orange 298 pre PURPLE 200 main orange 398 post orange (the one we get, and assert with) OR 415 pre orange 200 main orange 398 post orange 298 pre PURPLE (assert woud not trigger, since we would have "_next == nullptr" and return) -------- **Solution** We need to convert the `assert` into a condition. If the condition fails, we have no main-loop, and can just return from `remove_main_post_loops`. Note: if we have an empty pre-loop that still has a main-loop, then the loop-tree must have the pre-loop and main-loop adjacent, i.e. you can get from the pre-loop to its main-loop via `_next`. That is because the `build_loop_tree` traversal cannot take any other path. ------------- Commit messages: - 8327423 Changes: https://git.openjdk.org/jdk/pull/18200/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18200&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327423 Stats: 64 lines in 2 files changed: 61 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18200/head:pull/18200 PR: https://git.openjdk.org/jdk/pull/18200 From duke at openjdk.org Mon Mar 11 16:45:03 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 11 Mar 2024 16:45:03 GMT Subject: Integrated: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions In-Reply-To: References: Message-ID: On Thu, 15 Feb 2024 18:42:49 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. This pull request has now been integrated. Changeset: 18de9321 Author: vamsi-parasa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/18de9321ce8722f244594b1ed3b62cd1421a7994 Stats: 824 lines in 10 files changed: 814 ins; 0 del; 10 mod 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions Reviewed-by: sviswanathan, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/17881 From never at openjdk.org Mon Mar 11 16:48:53 2024 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 11 Mar 2024 16:48:53 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18192#pullrequestreview-1928410210 From kxu at openjdk.org Mon Mar 11 16:52:07 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 11 Mar 2024 16:52:07 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: fix test by adding the missing inversion also excluding negative values for unsigned comparison ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/aa7fafb8..17a9dc37 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From never at openjdk.org Mon Mar 11 16:59:52 2024 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 11 Mar 2024 16:59:52 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Fix is a trivial improvement to JavaDoc. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18192#issuecomment-1988953716 From dnsimon at openjdk.org Mon Mar 11 17:08:13 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 17:08:13 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass [v2] In-Reply-To: References: Message-ID: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Doug Simon has updated the pull request incrementally with one additional commit since the last revision: update year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18192/files - new: https://git.openjdk.org/jdk/pull/18192/files/0df2978b..241c9b9c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18192&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18192&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18192.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18192/head:pull/18192 PR: https://git.openjdk.org/jdk/pull/18192 From dnsimon at openjdk.org Mon Mar 11 17:08:13 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 17:08:13 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18192#issuecomment-1988972833 From dnsimon at openjdk.org Mon Mar 11 17:08:13 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 17:08:13 GMT Subject: Integrated: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: <55yo2J9eifZi68SObJT7bF2R9jt2SpSdeAVdbJeNVQI=.e18cfd13-bcbf-4272-a15e-bee3941887a3@github.com> On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. This pull request has now been integrated. Changeset: b9bc31f7 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/b9bc31f7206bfde3d27be01adec9a658e086b86e Stats: 6 lines in 1 file changed: 2 ins; 0 del; 4 mod 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass Reviewed-by: gdub, never ------------- PR: https://git.openjdk.org/jdk/pull/18192 From dlong at openjdk.org Mon Mar 11 19:22:12 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 11 Mar 2024 19:22:12 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 11:12:53 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. > > There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. > > I am not very sure if I have changed all the places that should be. > > Performance: > > I wrote a simple JMH included in this patch. > > On Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > > Before this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.270 ? 0.011 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.479 ? 0.012 ns/op > > > After this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.264 ? 0.006 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.057 ? 0.005 ns/op > > > > Testing: fastdebug tier1-4 on Linux x64 src/hotspot/cpu/x86/c1_Defs_x86.hpp line 47: > 45: > 46: #ifdef _LP64 > 47: #define UNALLOCATED 3 // rsp, r15, r10 This affects pd_nof_caller_save_cpu_regs_frame_map below, but RBP is callee-saved, not caller-saved. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18167#discussion_r1520305676 From kvn at openjdk.org Mon Mar 11 20:04:12 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Mar 2024 20:04:12 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. This was from these changes [JDK-8000780](https://github.com/openjdk/jdk/commit/e184d5cc4ec66640366d2d30d8dfaba74a1003a7) May be @rkennke remember why he added it. May be for some debugging purpose. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-1989328275 From jkarthikeyan at openjdk.org Mon Mar 11 21:16:15 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 11 Mar 2024 21:16:15 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: <7b1BIvQpmoLhSzWqQ7haDBTQU1NDuddEm1TK7AgWnwY=.0e5222cc-b20d-4a19-94db-9cad00c6dbff@github.com> On Mon, 11 Mar 2024 16:52:07 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > fix test by adding the missing inversion > > also excluding negative values for unsigned comparison I think the cleanup looks good! I have mostly stylistic suggestions here. Also, the copyright header in `subnode.cpp` should be updated to read 2024. src/hotspot/share/opto/subnode.cpp line 1808: > 1806: // based on local information. If the input is constant, do it. > 1807: const Type* BoolNode::Value(PhaseGVN* phase) const { > 1808: Node *cmp = in(1); It's preferred to use `Type*` for pointer types, so this `Node *var` (and the others below) should be `Node* var`. src/hotspot/share/opto/subnode.cpp line 1809: > 1807: const Type* BoolNode::Value(PhaseGVN* phase) const { > 1808: Node *cmp = in(1); > 1809: if (cmp && cmp->is_Sub()) { Suggestion: if (cmp != nullptr && cmp->is_Sub()) { The `cmp` condition should be `cmp != nullptr`, to make it more clear what is being compared. test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java line 38: > 36: * @summary > 37: * @library /test/lib / > 38: * @run driver compiler.c2.TestBoolNodeGvn I think it'd be better to move the test to `c2.irTests`, so that it's grouped with other IR tests. Also, it would be good to add a `@bug` tag and fill out the `@summary` tag. ------------- PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1929183892 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1520397799 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1520394190 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1520411065 From duke at openjdk.org Mon Mar 11 21:57:31 2024 From: duke at openjdk.org (Oussama Louati) Date: Mon, 11 Mar 2024 21:57:31 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v7] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Converted regression Test for "MethodType leaks memory to use Classfile API ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/5fd2d743..74c14dd4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=05-06 Stats: 20 lines in 2 files changed: 5 ins; 5 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From duke at openjdk.org Mon Mar 11 22:04:24 2024 From: duke at openjdk.org (Oussama Louati) Date: Mon, 11 Mar 2024 22:04:24 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v8] In-Reply-To: References: Message-ID: <_ZWd6EXhs0Wu4aHS48eHuC0M-Hy-ZabVdRMjH-KbT0Y=.95ff4c3e-85c4-4102-96f2-dba7a52e7251@github.com> > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Use Java 11 version for class generation in regression test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/74c14dd4..be3f49b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=06-07 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From dlong at openjdk.org Mon Mar 11 23:18:13 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 11 Mar 2024 23:18:13 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. src/hotspot/cpu/aarch64/aarch64.ad line 16026: > 16024: predicate(LockingMode == LM_LIGHTWEIGHT); > 16025: match(Set cr (FastLock object box)); > 16026: effect(TEMP tmp, TEMP tmp2, TEMP tmp3); Why not use `box` as the temp instead of intruducing a separate temp? Suggestion: effect(TEMP tmp, TEMP tmp2, TEMP box); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1520532174 From gcao at openjdk.org Tue Mar 12 01:29:24 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 12 Mar 2024 01:29:24 GMT Subject: RFR: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 02:53:12 GMT, Fei Yang wrote: >> Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. >> >> ### Tests >> - [x] Run tier1-3 tests on on LicheePI 4A (release) >> - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) > > Looks good. Thanks! @RealFYang : Thanks for your review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18175#issuecomment-1989739973 From gcao at openjdk.org Tue Mar 12 01:32:20 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 12 Mar 2024 01:32:20 GMT Subject: Integrated: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint In-Reply-To: References: Message-ID: On Sat, 9 Mar 2024 09:39:38 GMT, Gui Cao wrote: > Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. > > ### Tests > - [x] Run tier1-3 tests on on LicheePI 4A (release) > - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) This pull request has now been integrated. Changeset: 4d6235ed Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/4d6235ed111178d31814763b0d23e372db2b3e1b Stats: 23 lines in 3 files changed: 0 ins; 0 del; 23 mod 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/18175 From ksakata at openjdk.org Tue Mar 12 01:49:37 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Tue, 12 Mar 2024 01:49:37 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output Message-ID: This is a trivial change to remove an extra whitespace. A double whitespace is printed because method->print_short_name already adds a whitespace before the name. ### Test For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. @Test @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) public static void test15(Object o) { This change was only for testing, so I reverted back to the original code after the test. #### Execution Result Before the change: $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" ... Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap plyIfAnd={}, applyIfNot={})" > Phase "After Parsing": - counts: Graph contains wrong number of nodes: * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" - Failed comparison: [found] 1 = 11 [given] - Matched node: * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) After the change: $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" ... Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap plyIfAnd={}, applyIfNot={})" > Phase "After Parsing": - counts: Graph contains wrong number of nodes: * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" - Failed comparison: [found] 1 = 11 [given] - Matched node: * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) I was able confirm that the thing has been corrected. ------------- Commit messages: - Remove an extra whitespace Changes: https://git.openjdk.org/jdk/pull/18181/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18181&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8320404 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18181/head:pull/18181 PR: https://git.openjdk.org/jdk/pull/18181 From kvn at openjdk.org Tue Mar 12 02:52:12 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 12 Mar 2024 02:52:12 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 15:56:38 GMT, Emanuel Peter wrote: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... I agree with fix. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18200#pullrequestreview-1929992678 From galder at openjdk.org Tue Mar 12 05:35:12 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 12 Mar 2024 05:35:12 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:41:20 GMT, Roman Kennke wrote: >> A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Marked as reviewed by galder (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/18120#pullrequestreview-1930137852 From galder at openjdk.org Tue Mar 12 05:35:13 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 12 Mar 2024 05:35:13 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 14:37:19 GMT, Roman Kennke wrote: >> I think the changes look fine, but looking closer to the original PR, src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.hpp might also need adjusting. s390 and ppc are probably just fine. > > @galderz is it ok now? I assume it counts as trivial, too? @rkennke Yeah, ok now. Trivial too. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18120#issuecomment-1990584589 From aboldtch at openjdk.org Tue Mar 12 07:28:13 2024 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Tue, 12 Mar 2024 07:28:13 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 23:15:50 GMT, Dean Long wrote: >> This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. >> >> Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. >> >> #### Testing >> >> - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. > > src/hotspot/cpu/aarch64/aarch64.ad line 16026: > >> 16024: predicate(LockingMode == LM_LIGHTWEIGHT); >> 16025: match(Set cr (FastLock object box)); >> 16026: effect(TEMP tmp, TEMP tmp2, TEMP tmp3); > > Why not use `box` as the temp instead of intruducing a separate temp? > Suggestion: > > effect(TEMP tmp, TEMP tmp2, TEMP box); Making an input TEMP will crash inside C2 when doing register allocation. assert(opcnt < numopnds) failed: Accessing non-existent operand V [libjvm.dylib+0xcc650c] MachNode::in_RegMask(unsigned int) const+0x1f0 V [libjvm.dylib+0x3d136c] PhaseChaitin::gather_lrg_masks(bool)+0x1130 V [libjvm.dylib+0x3cef9c] PhaseChaitin::Register_Allocate()+0x150 V [libjvm.dylib+0x4c3860] Compile::Code_Gen()+0x1f4 V [libjvm.dylib+0x4c17a4] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1388 V [libjvm.dylib+0x38abf0] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0 V [libjvm.dylib+0x4df028] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854 V [libjvm.dylib+0x4de46c] CompileBroker::compiler_thread_loop()+0x348 V [libjvm.dylib+0x8c10e4] JavaThread::thread_main_inner()+0x1dc V [libjvm.dylib+0x117f7f8] Thread::call_run()+0xf4 V [libjvm.dylib+0xe53724] thread_native_entry(Thread*)+0x138 C [libsystem_pthread.dylib+0x7034] _pthread_start+0x88 Maybe this can be resolved and support for TEMP input registers can be added. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1520969175 From epeter at openjdk.org Tue Mar 12 07:30:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 07:30:23 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v6] In-Reply-To: <3F6Lk5fgyufLqS9TwuZrNkiIuSCRzuSHKB2BD0cf2Ws=.c4958e55-8384-451a-a505-a082c9c3ed7e@github.com> References: <3ELCyRCBgVgCAApIxalvVinXQLbSdv-UK8_aHgbWLhA=.f2908c00-cdbb-4b82-a36d-8a1a21f2647b@github.com> <3F6Lk5fgyufLqS9TwuZrNkiIuSCRzuSHKB2BD0cf2Ws=.c4958e55-8384-451a-a505-a082c9c3ed7e@github.com> Message-ID: <42HQ9nFhr5bgo6gTtOms4H9H-C9K5FhoB0qrQFA1Hzo=.e6f06954-fcaa-4ed3-8865-092dbd5fb35a@github.com> On Wed, 28 Feb 2024 18:11:28 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: >> >> - Merge branch 'master' into JDK-8309267 >> - Apply suggestions for comments by Vladimir >> - Update LoopArrayIndexComputeTest.java copyright year >> - Update src/hotspot/share/opto/superword.cpp >> - SplitStatus::Kind enum >> - SplitTask::Kind enum >> - manual merge >> - more fixes for TestSplitPacks.java >> - fix some IR rules in TestSplitPacks.java >> - fix MulAddS2I >> - ... and 23 more: https://git.openjdk.org/jdk/compare/de428daf...77e3d47a > > Looks good. Thanks @vnkozlov @iwanowww @chhagedorn for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17848#issuecomment-1990939531 From epeter at openjdk.org Tue Mar 12 07:30:24 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 07:30:24 GMT Subject: Integrated: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) In-Reply-To: References: Message-ID: On Wed, 14 Feb 2024 15:10:18 GMT, Emanuel Peter wrote: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. This pull request has now been integrated. Changeset: 251347bd Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/251347bd7e589b51354a2318bfac0c71cd71bf5f Stats: 1265 lines in 5 files changed: 1203 ins; 23 del; 39 mod 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) Reviewed-by: kvn, vlivanov, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/17848 From chagedorn at openjdk.org Tue Mar 12 08:05:14 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 12 Mar 2024 08:05:14 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 15:56:38 GMT, Emanuel Peter wrote: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... That looks good to me, too. test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java line 31: > 29: * -XX:CompileCommand=compileonly,compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop::test > 30: * compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop > 31: * @run main/othervm compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop Suggestion: * @run main compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18200#pullrequestreview-1930337154 PR Review Comment: https://git.openjdk.org/jdk/pull/18200#discussion_r1521008096 From chagedorn at openjdk.org Tue Mar 12 08:13:13 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 12 Mar 2024 08:13:13 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:41:29 GMT, Koichi Sakata wrote: > This is a trivial change to remove an extra whitespace. > > A double whitespace is printed because method->print_short_name already adds a whitespace before the name. > > ### Test > > For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. > > > @Test > @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) > @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) > public static void test15(Object o) { > > > This change was only for testing, so I reverted back to the original code after the test. > > #### Execution Result > > Before the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" > - Failed comparison: [found] 1 = 11 [given] > - Matched node: > * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) > > > After the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Cons... Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18181#pullrequestreview-1930353634 From galder at openjdk.org Tue Mar 12 08:22:14 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 12 Mar 2024 08:22:14 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: <7-xJ8ujbaK_K90zgAFMgdWGkpnN6u8o088Wdt-YCh88=.230e0470-32e0-4da6-a185-65682d4713bb@github.com> References: <7-xJ8ujbaK_K90zgAFMgdWGkpnN6u8o088Wdt-YCh88=.230e0470-32e0-4da6-a185-65682d4713bb@github.com> Message-ID: On Fri, 8 Mar 2024 01:42:03 GMT, Dean Long wrote: > Your front-end changes require back-end changes, which are only implemented for x86 and aarch64. So you need a way to disable this for other platforms, or port the fix to all platforms. Minimizing the amount of platform-specific code required would also help. I'm struggling to understand what it is you think is missing in the PR. I have added the following 2 sections in such a way that they only trigger in x86 and aarch64. See [here](https://github.com/openjdk/jdk/pull/17667/files#diff-737789206706361d06d1f120e10272b62bcfdb556e8e73693f94ec87f2a6b369R238) and [here](https://github.com/openjdk/jdk/pull/17667/files#diff-e6f3ae4492965efd0d73c3f31073ec8b77e020740b009f92312658bac1e5f978R356), and as far as I understand it, that's enough to address your concerns. Please let me know if there is something I might have missed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1991014592 From epeter at openjdk.org Tue Mar 12 08:24:29 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 08:24:29 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert [v2] In-Reply-To: References: Message-ID: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18200/files - new: https://git.openjdk.org/jdk/pull/18200/files/96116022..81c84dda Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18200&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18200&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18200/head:pull/18200 PR: https://git.openjdk.org/jdk/pull/18200 From roland at openjdk.org Tue Mar 12 08:37:12 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 12 Mar 2024 08:37:12 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert [v2] In-Reply-To: References: Message-ID: <-XVbBb-rqIblr2ytrCprcBv7Kg_TW4lpgeE8ZACVIuw=.e77a67bd-d038-4a08-9969-4ce6d3e27309@github.com> On Tue, 12 Mar 2024 08:24:29 GMT, Emanuel Peter wrote: >> The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. >> >> ---------- >> >> **Problem** >> >> By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: >> the PURPLE main and post loops are already previously removed as empty-loops. >> >> At the time of the assert, the graph looks like this: >> ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) >> >> We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. >> >> The loop-tree looks essencially like this: >> >> (rr) p _ltree_root->dump() >> Loop: N0/N0 has_sfpt >> Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } >> Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre >> Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } >> Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } >> >> >> This is basically: >> >> 415 pre orange >> 298 pre PURPLE >> 200 main orange >> 398 post orange >> >> >> From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. >> There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... >> ...but not the same pre-loop as `cl` -> the `assert` fires. >> >> It seems that we assume in the code, that we can check the `_next->_head`, and if: >> 1) it is a main-loop and >> 2) that main-loop still has a pre-loop >> then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. >> But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". >> >> The loop-tree is correct here, and this is how it was arrived at: >> "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. >> If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. >> But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. >> >> So depending on what turn we take at this "427 If", we either get the order: >> >> >> 415 pre orange >> 298 pre PURPLE >> 200 main orange >> 398 post orange >> >> (the one w... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java > > Co-authored-by: Christian Hagedorn Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18200#pullrequestreview-1930405522 From yzheng at openjdk.org Tue Mar 12 10:49:20 2024 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 12 Mar 2024 10:49:20 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic Message-ID: Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. ------------- Commit messages: - Simplify BigInteger.implMultiplyToLen intrinsic Changes: https://git.openjdk.org/jdk/pull/18226/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18226&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327964 Stats: 53 lines in 2 files changed: 4 ins; 49 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18226.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18226/head:pull/18226 PR: https://git.openjdk.org/jdk/pull/18226 From shade at openjdk.org Tue Mar 12 12:02:15 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 12 Mar 2024 12:02:15 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:41:20 GMT, Roman Kennke wrote: >> A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18120#pullrequestreview-1930857425 From rkennke at openjdk.org Tue Mar 12 12:10:21 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 12 Mar 2024 12:10:21 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:41:20 GMT, Roman Kennke wrote: >> A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18120#issuecomment-1991500115 From rkennke at openjdk.org Tue Mar 12 12:10:22 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 12 Mar 2024 12:10:22 GMT Subject: Integrated: 8327361: Update some comments after JDK-8139457 In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:06:48 GMT, Roman Kennke wrote: > A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. This pull request has now been integrated. Changeset: 5056902e Author: Roman Kennke URL: https://git.openjdk.org/jdk/commit/5056902e767d7f8485f9ff54f26df725f437fb0b Stats: 18 lines in 3 files changed: 0 ins; 0 del; 18 mod 8327361: Update some comments after JDK-8139457 Reviewed-by: galder, shade ------------- PR: https://git.openjdk.org/jdk/pull/18120 From bkilambi at openjdk.org Tue Mar 12 14:49:15 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 12 Mar 2024 14:49:15 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style Can I please ask for more reviews for this PR? Thank you in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1991824113 From duke at openjdk.org Tue Mar 12 15:41:31 2024 From: duke at openjdk.org (Oussama Louati) Date: Tue, 12 Mar 2024 15:41:31 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v9] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Use ClassFile to get AccessFlags ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/be3f49b6..7056d444 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=07-08 Stats: 34 lines in 12 files changed: 15 ins; 0 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From dchuyko at openjdk.org Tue Mar 12 15:53:39 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Tue, 12 Mar 2024 15:53:39 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: - Resolved master conflicts - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=28 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From duke at openjdk.org Tue Mar 12 16:02:23 2024 From: duke at openjdk.org (Tom Shull) Date: Tue, 12 Mar 2024 16:02:23 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 18:56:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > make vpmadd52l/hq generic src/hotspot/cpu/x86/vm_version_x86.cpp line 312: > 310: __ lea(rsi, Address(rbp, in_bytes(VM_Version::sef_cpuid7_ecx1_offset()))); > 311: __ movl(Address(rsi, 0), rax); > 312: __ movl(Address(rsi, 4), rbx); Hi @vamsi-parasa. I believe this code as a bug in it. Here you are copying back all four registers; however, within https://github.com/openjdk/jdk/blob/782206bc97dc6ae953b0c3ce01f8b6edab4ad30b/src/hotspot/cpu/x86/vm_version_x86.hpp#L468 you only created one field. Can you please open up a JBS issue to fix this? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1521736242 From epeter at openjdk.org Tue Mar 12 16:09:14 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:09:14 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style Looks interesting! Will have a look at it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1992017586 From roland at openjdk.org Tue Mar 12 16:13:14 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 12 Mar 2024 16:13:14 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 14:50:11 GMT, Christian Hagedorn wrote: >> Thanks for reviewing this. >> >>> The fix idea looks reasonable to me. I have two questions: >>> >>> * Do we really need to pin the `CastII` here? We have not pinned the `ConvL2I` before. And here I think we just want to ensure that the type is not lost. >> >> I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. >> >>> * Related to the first question, could we just use a normal dependency instead? >> >> The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. >> >>> I was also wondering if we should try to improve the type of `ConvL2I` and of `Add/Sub` (and possibly also `Mul`) nodes in general? For `ConvL2I`, we could set a better type if we know that `(int)lo <= (int)hi` and `abs(hi - lo) <= 2^32`. We still have a problem to set a better type if we have a narrow range of inputs that includes `min` and `max` (e.g. `min+1, min, max, max-1`). In this case, `ConvL2I` just uses `int` as type. Then we could go a step further and do the same type optimization for `Add/Sub` nodes by directly looking through a convert/cast node at the input type. The resulting `Add/Sub` range could maybe be represented by something better than `int`: >>> >>> Example: input type to `ConvL2I`: `[2147483647L, 2147483648L]` -> type of `ConvL2I` is `int` since we cannot represent "`[max_int, min_int]`" with two intervals otherwise. `AddI` = `ConvL2I` + 2 -> type could be improved to `[min_int+1,min_int+2]`. >>> >>> But that might succeed the scope of this fix. Going with `CastII` for now seems to be the least risk. >> >> I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. > >> I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. > > That is indeed a general problem. The situation certainly got better by removing the code that optimized cast nodes that were pinned at If Projections (https://github.com/openjdk/jdk/commit/7766785098816cfcdae3479540cdc866c1ed18ad). By pinning the casts now, you probably want to prevent the cast nodes to be pushed through nodes such that it floats "too high" and causing unforeseenable data graph folding while control is not? > >> The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. > > I see, thanks for the explanation. Then it makes sense to keep the cast node not matter what. > >> I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. > > I agree, this fix should use casts. Would be interesting to follow this idea in a separate RFE. > That is indeed a general problem. The situation certainly got better by removing the code that optimized cast nodes that were pinned at If Projections ([7766785](https://github.com/openjdk/jdk/commit/7766785098816cfcdae3479540cdc866c1ed18ad)). By pinning the casts now, you probably want to prevent the cast nodes to be pushed through nodes such that it floats "too high" and causing unforeseenable data graph folding while control is not? Something like that. I don't see how things could go wrong in this particular case so, quite possibly, the control input is useless. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1521752781 From epeter at openjdk.org Tue Mar 12 16:25:15 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:25:15 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 16:52:07 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > fix test by adding the missing inversion > > also excluding negative values for unsigned comparison Looks like a reasonable idea. Running tests now. Will review afterwards. src/hotspot/share/opto/subnode.cpp line 1812: > 1810: int cop = cmp->Opcode(); > 1811: Node *cmp1 = cmp->in(1); > 1812: Node *cmp2 = cmp->in(2); Suggestion: Node* cmp1 = cmp->in(1); Node* cmp2 = cmp->in(2); ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1931549333 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1521763748 From epeter at openjdk.org Tue Mar 12 16:25:16 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:25:16 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:18:04 GMT, Emanuel Peter wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test by adding the missing inversion >> >> also excluding negative values for unsigned comparison > > src/hotspot/share/opto/subnode.cpp line 1812: > >> 1810: int cop = cmp->Opcode(); >> 1811: Node *cmp1 = cmp->in(1); >> 1812: Node *cmp2 = cmp->in(2); > > Suggestion: > > Node* cmp1 = cmp->in(1); > Node* cmp2 = cmp->in(2); Ah, just like @jaskarth already said ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1521764243 From epeter at openjdk.org Tue Mar 12 16:56:19 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:56:19 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:51:27 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Merge branch 'master' into round-v-exhaustive-tests >> - fix issue >> - mv tests >> - use IR framework to construct the random tests >> - Initial commit > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 124: > >> 122: bits = bits | (1 << 63); >> 123: input[ei_idx*2+1] = Double.longBitsToDouble(bits); >> 124: } > > Why do all this complicated stuff, and not just pick a random `long`, and convert it to double with `Double.longToDoubleBits`? Does this ever generate things like `+0, -0, infty, NaN` etc? > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 134: > >> 132: for (int sign = 0; sign < 2; sign++) { >> 133: int idx = ei_idx * 2 + sign; >> 134: if (res[idx] != Math.round(input[idx])) { > > Is it ok to use `Math.round` here? What if we compile it, and its computation is wrong in the compilation? This direct comparison tells me that you are not testing `NaN`s... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521817370 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521817764 From epeter at openjdk.org Tue Mar 12 16:56:19 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:56:19 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 27 Feb 2024 20:59:14 GMT, Hamlin Li wrote: >> HI, >> Can you have a look at this patch adding some tests for Math.round instrinsics? >> Thanks! >> >> ### FYI: >> During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. >> It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. >> Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. > > Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into round-v-exhaustive-tests > - fix issue > - mv tests > - use IR framework to construct the random tests > - Initial commit Thanks for changing to randomness. Thanks very much for your work! I have a few more requests/suggestions/questions :) test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 32: > 30: * @requires vm.compiler2.enabled > 31: * @requires (vm.cpu.features ~= ".*avx512dq.*" & os.simpleArch == "x64") | > 32: * os.simpleArch == "aarch64" We should be a able to run the tests on all platforms, with any compiler. But you can add platform restrictions to the IR rules, with `appliyIf...`. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 35: > 33: * > 34: * @library /test/lib / > 35: * @run driver compiler.vectorization.TestRoundVectorDoubleRandom Suggestion: * @run main compiler.vectorization.TestRoundVectorDoubleRandom Driver setting apparently does not allow passing flags from the outside. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 91: > 89: test_round(res, input); > 90: // skip test/verify when warming up > 91: if (runInfo.isWarmUp()) { Hmm. This means that if there is a OSR compilation during warmup, we would not verify. Are we ok with that? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 101: > 99: final int f_width = e_shift; > 100: final long f_bound = 1 << f_width; > 101: final int f_num = 256; Code style: you are generally not supposed to use under_score for variables, but camelCase, I think. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 111: > 109: fis[fidx++] = 0; > 110: for (; fidx < f_num; fidx++) { > 111: fis[fidx] = ThreadLocalRandom.current().nextLong(f_bound); Why are you not using `rand`? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 124: > 122: bits = bits | (1 << 63); > 123: input[ei_idx*2+1] = Double.longBitsToDouble(bits); > 124: } Why do all this complicated stuff, and not just pick a random `long`, and convert it to double with `Double.longToDoubleBits`? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 134: > 132: for (int sign = 0; sign < 2; sign++) { > 133: int idx = ei_idx * 2 + sign; > 134: if (res[idx] != Math.round(input[idx])) { Is it ok to use `Math.round` here? What if we compile it, and its computation is wrong in the compilation? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17753#pullrequestreview-1931612345 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521800236 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521801642 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521807957 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521809457 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521813203 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521815343 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521816420 From mli at openjdk.org Tue Mar 12 17:15:39 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 17:15:39 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v4] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: - remove ucast from i/s/b to float - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/646955f0..cc43650b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=02-03 Stats: 93 lines in 1 file changed: 23 ins; 50 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From mli at openjdk.org Tue Mar 12 17:15:40 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 17:15:40 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:42:07 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> fix typo > > src/hotspot/cpu/riscv/riscv_v.ad line 3220: > >> 3218: ins_encode %{ >> 3219: BasicType bt = Matcher::vector_element_basic_type(this); >> 3220: if (is_floating_point_type(bt)) { > > Could `bt` (the vector element basic type) be floating point type for `VectorUCastB2X` node? I see our aarch64 counterpart has this assertion: `assert(bt == T_SHORT || bt == T_INT || bt == T_LONG, "must be");` [1]. Same question for `VectorUCastS2X` and `VectorUCastI2X` nodes. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L3752 Yeh, seems it's not, and vector api does not have this operations either. Fixed, thanks for catching. > src/hotspot/cpu/riscv/riscv_v.ad line 3397: > >> 3395: predicate(Matcher::vector_element_basic_type(n) == T_FLOAT); >> 3396: match(Set dst (VectorCastL2X src)); >> 3397: effect(TEMP_DEF dst); > > I see you added `TEMP_DEF dst` for some existing instructs like this one here. Do we really need it? > I don't see such a need when reading the overlap constraints on vector operands from the RVV spec [1]: > > > A destination vector register group can overlap a source vector register group only if one of the following holds: > > The destination EEW equals the source EEW. > > The destination EEW is smaller than the source EEW and the overlap is in the lowest-numbered part of the source register group (e.g., when LMUL=1, vnsrl.wi v0, v0, 3 is legal, but a destination of v1 is not). > > The destination EEW is greater than the source EEW, the source EMUL is at least 1, and the overlap is in the highest-numbered part of the destination register group (e.g., when LMUL=8, vzext.vf4 v0, v6 is legal, but a source of v0, v2, or v4 is not). > > > [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vec-operands You're right, thanks for sharing the information. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1521844532 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1521843212 From mli at openjdk.org Tue Mar 12 17:21:24 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 17:21:24 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v5] In-Reply-To: References: Message-ID: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - merge master - remove ucast from i/s/b to float - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics - fix typo - modify test config - clean code - add more tests - rearrange tests layout - merge master - Initial commit ------------- Changes: https://git.openjdk.org/jdk/pull/18040/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=04 Stats: 665 lines in 6 files changed: 636 ins; 11 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From duke at openjdk.org Tue Mar 12 17:34:21 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 12 Mar 2024 17:34:21 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: <8Awj08UkB3CpLNbWQbQOgOHUPh0PSARYgOK83JDEt0I=.0f634cfa-ec4b-4d49-a849-726fbfb64703@github.com> On Tue, 12 Mar 2024 15:59:59 GMT, Tom Shull wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> make vpmadd52l/hq generic > > src/hotspot/cpu/x86/vm_version_x86.cpp line 312: > >> 310: __ lea(rsi, Address(rbp, in_bytes(VM_Version::sef_cpuid7_ecx1_offset()))); >> 311: __ movl(Address(rsi, 0), rax); >> 312: __ movl(Address(rsi, 4), rbx); > > Hi @vamsi-parasa. I believe this code has a bug in it. Here you are copying back all four registers; however, within https://github.com/openjdk/jdk/blob/782206bc97dc6ae953b0c3ce01f8b6edab4ad30b/src/hotspot/cpu/x86/vm_version_x86.hpp#L468 you only created one field. > > Can you please open up a JBS issue to fix this? Hi Tom (@teshull), Thank you for identifying the issue. Please see the JBS issue filed at https://bugs.openjdk.org/browse/JDK-8327999. Will float a new PR to fix this issue soon. Thanks, Vamsi ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1521869268 From chagedorn at openjdk.org Tue Mar 12 17:36:14 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 12 Mar 2024 17:36:14 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18170#pullrequestreview-1931804552 From epeter at openjdk.org Tue Mar 12 17:45:14 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 17:45:14 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: <_Qm0japYCaj72QGczOLYyKgmkaAA4P5AhG6QPfmd3Ys=.d0e57b13-4cb2-4f05-b902-e655d7f2a123@github.com> On Thu, 22 Feb 2024 14:36:52 GMT, Roland Westrelin wrote: > Long counted loop are transformed into a loop nest of 2 "regular" > loops and in a subsequent loop opts round, the inner loop is > transformed into a counted loop. The limit for the inner loop is set, > when the loop nest is created, so it's expected there's no need for a > loop limit check when the counted loop is created. The assert fires > because, when the counted loop is created, it is found that it needs a > loop limit check. The reason for that is that the limit is > transformed, between nest creation and counted loop creation, in a way > that the range of values of the inner loop's limit becomes > unknown. The limit when the nest is created is: > > > 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 > 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] > 122 ConvL2I === _ 112 [[ ]] #int > > > The type of 122 is `2..6` but it is then transformed to: > > > 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 191 ConvL2I === _ 106 [[ 196 ]] #int > 195 ConI === 0 [[ 196 ]] #int:max-1 > 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] > > > That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI > (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of > values returns TypeInt::INT and the bounds of the limit are lost. I > propose adding a `CastII` after the `ConvL2I` so the range of values > of the limit doesn't get lost. Looks reasonable, but these ad-hoc CastII also make me nervous. What worries me with adding such "Ad-Hoc" CastII nodes is that elsewhere a very similar computation may not have the same tight type. And then you have a tight type somewhere, and a loose type elsewhere. This is how we get the data-flow collapsing and the cfg not folding. @rwestrel please wait for our testing to complete, I just launched it. test/hotspot/jtreg/compiler/longcountedloops/TestInaccurateInnerLoopLimit.java line 40: > 38: > 39: public static void test() { > 40: for (long i = 9223372034707292164L; i > 9223372034707292158L; i += -2L) { } I'm always amazed at how such simple tests can fail. Is there any way we can improve the test coverage for Long loops? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17965#pullrequestreview-1931732253 PR Comment: https://git.openjdk.org/jdk/pull/17965#issuecomment-1992220126 PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1521846724 From epeter at openjdk.org Tue Mar 12 17:54:15 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 17:54:15 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style A suggestion about naming: We now have a few synonyms: Unordered reduction non-strict order reduction associative reduction I think I introduced the "unordered" one. Not proud of it any more. I think we should probably use (non) associative everywhere. That is the technical/mathematical term. We can use synonyms in the comments to make the explanation more clear though. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1992235715 From epeter at openjdk.org Tue Mar 12 18:02:14 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 18:02:14 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style On a more visionary note: We should make sure that the actual `ReductionNode` gets moved out of the loop, when possible. [JDK-8309647](https://bugs.openjdk.org/browse/JDK-8309647) [Vector API] Move Reduction outside loop when possible We have an RFE for that, I have not yet have time or priority for it. Of course the user can already move it out of the loop themselves. If the `ReductionNode` is out of the loop, then you usually just have a very cheap accumulation inside the loop, a `MulVF` for example. That would certainly be cheap enough to allow vectorization. So in that case, your optimization here should not just affect SVE, but also NEON and x86. Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1992247959 From dlong at openjdk.org Tue Mar 12 18:19:16 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 12 Mar 2024 18:19:16 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:12:12 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1992277510 From kxu at openjdk.org Tue Mar 12 18:43:47 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 12 Mar 2024 18:43:47 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v3] In-Reply-To: References: Message-ID: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: modification per code review suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/17a9dc37..06b7da36 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=01-02 Stats: 8 lines in 2 files changed: 1 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Tue Mar 12 18:43:47 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 12 Mar 2024 18:43:47 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: <7b1BIvQpmoLhSzWqQ7haDBTQU1NDuddEm1TK7AgWnwY=.0e5222cc-b20d-4a19-94db-9cad00c6dbff@github.com> References: <7b1BIvQpmoLhSzWqQ7haDBTQU1NDuddEm1TK7AgWnwY=.0e5222cc-b20d-4a19-94db-9cad00c6dbff@github.com> Message-ID: On Mon, 11 Mar 2024 21:13:21 GMT, Jasmine Karthikeyan wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test by adding the missing inversion >> >> also excluding negative values for unsigned comparison > > I think the cleanup looks good! I have mostly stylistic suggestions here. Also, the copyright header in `subnode.cpp` should be updated to read 2024. Thanks @jaskarth and @eme64 for the review. I've pushed a new commit to address the following: - Updated license header year to 2024 - Explicit `nullptr` comparison - `Node* var` for pointer types - Test moved to `c2.irTests`, added `@bug` and `@summary` tags ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-1992315599 From duke at openjdk.org Tue Mar 12 19:15:21 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 12 Mar 2024 19:15:21 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 15:59:59 GMT, Tom Shull wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> make vpmadd52l/hq generic > > src/hotspot/cpu/x86/vm_version_x86.cpp line 312: > >> 310: __ lea(rsi, Address(rbp, in_bytes(VM_Version::sef_cpuid7_ecx1_offset()))); >> 311: __ movl(Address(rsi, 0), rax); >> 312: __ movl(Address(rsi, 4), rbx); > > Hi @vamsi-parasa. I believe this code has a bug in it. Here you are copying back all four registers; however, within https://github.com/openjdk/jdk/blob/782206bc97dc6ae953b0c3ce01f8b6edab4ad30b/src/hotspot/cpu/x86/vm_version_x86.hpp#L468 you only created one field. > > Can you please open up a JBS issue to fix this? Hi Tom (@teshull), pls see the PR to fix this issue: https://github.com/openjdk/jdk/pull/18248 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1521996254 From shade at openjdk.org Tue Mar 12 19:38:22 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 12 Mar 2024 19:38:22 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal Message-ID: See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. --- x86_64 EC2, applications/ctw/modules CTW jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total --- Mac M1, applications/ctw/modules/java.base CTW jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/18249/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18249&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8325613 Stats: 26 lines in 2 files changed: 24 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18249.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18249/head:pull/18249 PR: https://git.openjdk.org/jdk/pull/18249 From dlong at openjdk.org Tue Mar 12 19:53:13 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 12 Mar 2024 19:53:13 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18183#pullrequestreview-1932260072 From mli at openjdk.org Tue Mar 12 20:19:17 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:19:17 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:46:17 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Merge branch 'master' into round-v-exhaustive-tests >> - fix issue >> - mv tests >> - use IR framework to construct the random tests >> - Initial commit > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 91: > >> 89: test_round(res, input); >> 90: // skip test/verify when warming up >> 91: if (runInfo.isWarmUp()) { > > Hmm. This means that if there is a OSR compilation during warmup, we would not verify. Are we ok with that? I'm not sure if it's necessary to verify that situation. But, if we verify the result during the warmup, it will rather longer time to finish the test. Please let me know if we need to verify during warmup. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522060608 From mli at openjdk.org Tue Mar 12 20:19:19 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:19:19 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:52:40 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 124: >> >>> 122: bits = bits | (1 << 63); >>> 123: input[ei_idx*2+1] = Double.longBitsToDouble(bits); >>> 124: } >> >> Why do all this complicated stuff, and not just pick a random `long`, and convert it to double with `Double.longToDoubleBits`? > > Does this ever generate things like `+0, -0, infty, NaN` etc? It's testing following cases: 1. all the `e` range, e.g. for double it's 11 bits, for float it's 8 bits 2. for `f` I add a special value `0` explicitly `fis[fidx++] = 0;` 3. for sign, both `+` and `-` are tested. So, yes, it will test cases like `+/- 0, infty, NaN`. >> test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 134: >> >>> 132: for (int sign = 0; sign < 2; sign++) { >>> 133: int idx = ei_idx * 2 + sign; >>> 134: if (res[idx] != Math.round(input[idx])) { >> >> Is it ok to use `Math.round` here? What if we compile it, and its computation is wrong in the compilation? > > This direct comparison tells me that you are not testing `NaN`s... > Is it ok to use Math.round here? What if we compile it, and its computation is wrong in the compilation? It's a bug, will fix. > This direct comparison tells me that you are not testing NaNs... this comparison is between long value, for NaN Math.round(NaN) == 0. Or maybe I misunderstood your question? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522059691 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522059293 From mli at openjdk.org Tue Mar 12 20:26:25 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:26:25 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: refine code; fix bug ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/7eeb3141..e1127c76 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=02-03 Stats: 51 lines in 2 files changed: 8 ins; 5 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From mli at openjdk.org Tue Mar 12 20:29:13 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:29:13 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: <54MAbAxe9ilnn_NtGTt04-k1IOqhPlegUk4XlMjQDGc=.6762ca45-b6e6-44ab-aeb9-9e0159223df3@github.com> On Tue, 12 Mar 2024 16:54:03 GMT, Emanuel Peter wrote: > Thanks for changing to randomness. Thanks very much for your work! > > I have a few more requests/suggestions/questions :) Thanks for detailed reviewing and suggestion! :) I resolved some comments, and tried to answer some of your questions, please have a look again. Also have a question: currently I'm generating golden value in following way: @DontCompile long golden_round(double d) { return Math.round(d); } Will it make sure Math.round invocation here are the interpreter version? Or maybe it can be calling the intrinsic version? If that's the case, I think one way to resolve this issue is to copy the piece of library code of Math.round here, I see some existing test cases also use this way to get the golden value. How do you think about it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17753#issuecomment-1992519401 From jkarthikeyan at openjdk.org Tue Mar 12 22:13:14 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 12 Mar 2024 22:13:14 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v3] In-Reply-To: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> References: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> Message-ID: On Tue, 12 Mar 2024 18:43:47 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > modification per code review suggestions Thanks for the update! Just one more thing from me. test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 39: > 37: * @summary Refactor boolean node tautology transformations > 38: * @library /test/lib / > 39: * @run driver compiler.c2.TestBoolNodeGvn Suggestion: * @run driver compiler.c2.irTests.TestBoolNodeGvn Since the test's package changed, this'll need to be changed as well. ------------- PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1932781890 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1522167995 From ddong at openjdk.org Wed Mar 13 00:03:17 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 00:03:17 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18170#issuecomment-1992757081 From ddong at openjdk.org Wed Mar 13 00:03:18 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 00:03:18 GMT Subject: Integrated: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks This pull request has now been integrated. Changeset: 5d4bfad1 Author: Denghui Dong URL: https://git.openjdk.org/jdk/commit/5d4bfad12b650b9f7c512a071830c58b8f1d020b Stats: 25 lines in 2 files changed: 12 ins; 9 del; 4 mod 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code Reviewed-by: gli, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/18170 From kxu at openjdk.org Wed Mar 13 02:05:39 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Wed, 13 Mar 2024 02:05:39 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: update the package name for tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/06b7da36..e2eb8bf9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Wed Mar 13 02:05:40 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Wed, 13 Mar 2024 02:05:40 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v3] In-Reply-To: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> References: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> Message-ID: On Tue, 12 Mar 2024 18:43:47 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > modification per code review suggestions Oops. Package name updated. Sorry for such a rookie mistake! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-1993109007 From ddong at openjdk.org Wed Mar 13 02:24:40 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 02:24:40 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v2] In-Reply-To: References: Message-ID: > Hi, > > Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. > > There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. > > I am not very sure if I have changed all the places that should be. > > Performance: > > I wrote a simple JMH included in this patch. > > On Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > > Before this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.270 ? 0.011 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.479 ? 0.012 ns/op > > > After this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.264 ? 0.006 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.057 ? 0.005 ns/op > > > > Testing: fastdebug tier1-4 on Linux x64 Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: fix: rbp should be callee saved ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18167/files - new: https://git.openjdk.org/jdk/pull/18167/files/6e8020fb..a6270736 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=00-01 Stats: 18 lines in 5 files changed: 8 ins; 3 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/18167.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18167/head:pull/18167 PR: https://git.openjdk.org/jdk/pull/18167 From ddong at openjdk.org Wed Mar 13 02:33:12 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 02:33:12 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v2] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 19:19:32 GMT, Dean Long wrote: >> Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: >> >> fix: rbp should be callee saved > > src/hotspot/cpu/x86/c1_Defs_x86.hpp line 47: > >> 45: >> 46: #ifdef _LP64 >> 47: #define UNALLOCATED 3 // rsp, r15, r10 > > This affects pd_nof_caller_save_cpu_regs_frame_map below, but RBP is callee-saved, not caller-saved. Yes. I updated the patch. I want to confirm, if we treat RBP as caller saved, is there any correctness problem? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18167#discussion_r1522399378 From jkarthikeyan at openjdk.org Wed Mar 13 04:08:13 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 13 Mar 2024 04:08:13 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: <92BvqrZ-rwtf2tU1yJuKAVvgqVaZd8Q7Gfi4PNZBBk8=.ce7d0e76-ea10-47fc-b5c0-78ab7692b482@github.com> On Wed, 13 Mar 2024 02:05:39 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update the package name for tests No worries, looks good to me now :) ------------- Marked as reviewed by jkarthikeyan (Author). PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1933123222 From ddong at openjdk.org Wed Mar 13 06:49:30 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 06:49:30 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v3] In-Reply-To: References: Message-ID: > Hi, > > Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. > > There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. > > I am not very sure if I have changed all the places that should be. > > Testing: fastdebug tier1-4 on Linux x64 Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: delete jmh ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18167/files - new: https://git.openjdk.org/jdk/pull/18167/files/a6270736..972b12ee Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=01-02 Stats: 74 lines in 1 file changed: 0 ins; 74 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18167.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18167/head:pull/18167 PR: https://git.openjdk.org/jdk/pull/18167 From epeter at openjdk.org Wed Mar 13 06:59:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 13 Mar 2024 06:59:21 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 20:26:25 GMT, Hamlin Li wrote: >> HI, >> Can you have a look at this patch adding some tests for Math.round instrinsics? >> Thanks! >> >> ### FYI: >> During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. >> It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. >> Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > refine code; fix bug Thanks for adjusting the IR rules! I still have trouble reviewing your input value generation, and a few other comments. Thanks for the work you are putting in, I really appreciate it ? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 30: > 28: * @summary Test vector intrinsic for Math.round(double) in full 64 bits range. > 29: * > 30: * @requires vm.compiler2.enabled Do we really require C2? We should also run this for C1, and any other potential compiler. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 110: > 108: fis[fidx] = 1 << fidx; > 109: } > 110: fis[fidx++] = 0; The zero is now always in the same spot. What if vectorization messes up only in a specific slot, and then never encounters that zero? We would maybe never see a zero in that bad spot. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17753#pullrequestreview-1933271647 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522615886 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522626600 From epeter at openjdk.org Wed Mar 13 06:59:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 13 Mar 2024 06:59:22 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 20:16:02 GMT, Hamlin Li wrote: >> Does this ever generate things like `+0, -0, infty, NaN` etc? > > It's testing following cases: > 1. all the `e` range, e.g. for double it's 11 bits, for float it's 8 bits > 2. for `f` I add a special value `0` explicitly `fis[fidx++] = 0;` > 3. for sign, both `+` and `-` are tested. > > So, yes, it will test cases like `+/- 0, infty, NaN`. Can you refactor or at least comment the code a little better, or use more expressive variable names? I'd have to spend a bit of time to understand your generation method here, and if I think that it is exhaustive and covers the special cases with enough frequency. >> This direct comparison tells me that you are not testing `NaN`s... > >> Is it ok to use Math.round here? What if we compile it, and its computation is wrong in the compilation? > > It's a bug, will fix. > >> This direct comparison tells me that you are not testing NaNs... > > this comparison is between long value, for NaN Math.round(NaN) == 0. Or maybe I misunderstood your question? Yes, you are right. I somehow thought that `Math.round` returns a float/double. But it is int/long. So exact comparison is good ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522626925 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522620080 From stuefe at openjdk.org Wed Mar 13 07:25:20 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 13 Mar 2024 07:25:20 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm Message-ID: ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. According to POSIX, it should be valid to pass into setlocale output from setlocale. However, glibc seems to delete the old string when calling setlocale again: https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 Best to make a copy, and pass in the copy to setlocale. ------------- Commit messages: - JDK-8327986-ASAN-reports-use-after-free-in-DirectivesParserTest-empty_object_vm Changes: https://git.openjdk.org/jdk/pull/18235/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18235&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327986 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18235.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18235/head:pull/18235 PR: https://git.openjdk.org/jdk/pull/18235 From sspitsyn at openjdk.org Wed Mar 13 07:46:18 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 07:46:18 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 15:53:39 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: > > - Resolved master conflicts > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 src/hotspot/share/ci/ciEnv.cpp line 1144: > 1142: > 1143: if (entry_bci == InvocationEntryBci) { > 1144: if (TieredCompilation) { Just a naive question. Why this check has been removed? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1522682325 From sspitsyn at openjdk.org Wed Mar 13 07:51:18 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 07:51:18 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 15:53:39 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: > > - Resolved master conflicts > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 src/hotspot/share/services/diagnosticCommand.cpp line 928: > 926: DCmdWithParser(output, heap), > 927: _filename("filename", "Name of the directives file", "STRING", true), > 928: _refresh("-r", "Refresh affected methods.", "BOOLEAN", false, "false") { Nit: The dot is not needed at the end, I think. The same applies to lines: 945, 970 and 987. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1522688369 From roland at openjdk.org Wed Mar 13 08:01:12 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 13 Mar 2024 08:01:12 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total Looks reasonable to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18249#pullrequestreview-1933379637 From djelinski at openjdk.org Wed Mar 13 08:06:13 2024 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Wed, 13 Mar 2024 08:06:13 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. test/hotspot/gtest/compiler/test_directivesParser.cpp line 39: > 37: // These tests require the "C" locale to correctly parse decimal values > 38: DirectivesParserTest() : _locale(os::strdup(setlocale(LC_NUMERIC, nullptr), mtTest)) { > 39: setlocale(LC_NUMERIC, "C"); Would it fix the issue if we did this instead? Suggestion: DirectivesParserTest() : _locale(setlocale(LC_NUMERIC, "C")) { seems to me that the string returned by setlocale is only valid until the next setlocale call, and currently we call setlocale twice in the constructor, and save the result of the first call. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18235#discussion_r1522707838 From rcastanedalo at openjdk.org Wed Mar 13 08:17:17 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 13 Mar 2024 08:17:17 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. Thanks for reviewing, Axel and Deal! And thanks Axel for trying out Dean's suggestion! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18183#issuecomment-1993783549 From rcastanedalo at openjdk.org Wed Mar 13 08:17:18 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 13 Mar 2024 08:17:18 GMT Subject: Integrated: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. This pull request has now been integrated. Changeset: 07acc0bb Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/07acc0bbad2cd5b37013d17785ca466429966a0d Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect Reviewed-by: aboldtch, dlong ------------- PR: https://git.openjdk.org/jdk/pull/18183 From stuefe at openjdk.org Wed Mar 13 08:30:13 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 13 Mar 2024 08:30:13 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: <-IZ9AM14NCXAh8wTczqvc9WM77wOt2D9JD7ACF0SGxg=.80d254b7-5b92-451f-9897-5edfccb389df@github.com> On Wed, 13 Mar 2024 08:03:49 GMT, Daniel Jeli?ski wrote: >> ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. >> >> According to POSIX, it should be valid to pass into setlocale output from setlocale. >> >> However, glibc seems to delete the old string when calling setlocale again: >> >> https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 >> >> Best to make a copy, and pass in the copy to setlocale. > > test/hotspot/gtest/compiler/test_directivesParser.cpp line 39: > >> 37: // These tests require the "C" locale to correctly parse decimal values >> 38: DirectivesParserTest() : _locale(os::strdup(setlocale(LC_NUMERIC, nullptr), mtTest)) { >> 39: setlocale(LC_NUMERIC, "C"); > > Would it fix the issue if we did this instead? > > Suggestion: > > DirectivesParserTest() : _locale(setlocale(LC_NUMERIC, "C")) { > > > seems to me that the string returned by setlocale is only valid until the next setlocale call, and currently we call setlocale twice in the constructor, and save the result of the first call. No. The first setlocate call returns the pointer to the last locale, which becomes invalid. Changing the input string on the first setlocale call won't change that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18235#discussion_r1522743065 From djelinski at openjdk.org Wed Mar 13 09:19:14 2024 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Wed, 13 Mar 2024 09:19:14 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: <-IZ9AM14NCXAh8wTczqvc9WM77wOt2D9JD7ACF0SGxg=.80d254b7-5b92-451f-9897-5edfccb389df@github.com> References: <-IZ9AM14NCXAh8wTczqvc9WM77wOt2D9JD7ACF0SGxg=.80d254b7-5b92-451f-9897-5edfccb389df@github.com> Message-ID: <9RM-tJQcA0tEL5iOy7UOE6XJAqIphmYfJyy6Ydgkmm4=.e96b3e85-bd3c-426d-aeb7-3e868294fbb3@github.com> On Wed, 13 Mar 2024 08:27:27 GMT, Thomas Stuefe wrote: >> test/hotspot/gtest/compiler/test_directivesParser.cpp line 39: >> >>> 37: // These tests require the "C" locale to correctly parse decimal values >>> 38: DirectivesParserTest() : _locale(os::strdup(setlocale(LC_NUMERIC, nullptr), mtTest)) { >>> 39: setlocale(LC_NUMERIC, "C"); >> >> Would it fix the issue if we did this instead? >> >> Suggestion: >> >> DirectivesParserTest() : _locale(setlocale(LC_NUMERIC, "C")) { >> >> >> seems to me that the string returned by setlocale is only valid until the next setlocale call, and currently we call setlocale twice in the constructor, and save the result of the first call. > > No. The first setlocate call returns the pointer to the last locale, which becomes invalid. Changing the input string on the first setlocale call won't change that. Ah. I was misled by the `setlocale` docs: > The string returned is such that a subsequent call with that string and its associated category will restore that part of the process's locale. apparently it doesn't restore them _to the previous value_, as I incorrectly assumed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18235#discussion_r1522820269 From thartmann at openjdk.org Wed Mar 13 10:21:14 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 13 Mar 2024 10:21:14 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v5] In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 14:22:11 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > format That looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1933708125 From duke at openjdk.org Wed Mar 13 10:34:42 2024 From: duke at openjdk.org (Oussama Louati) Date: Wed, 13 Mar 2024 10:34:42 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v10] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with two additional commits since the last revision: - halfway through this migration, had to switch to other test group and out these aside - Use ClassFile to get AccessFlags ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/7056d444..527384d3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=08-09 Stats: 40 lines in 8 files changed: 20 ins; 10 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From fyang at openjdk.org Wed Mar 13 13:38:15 2024 From: fyang at openjdk.org (Fei Yang) Date: Wed, 13 Mar 2024 13:38:15 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v5] In-Reply-To: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> References: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> Message-ID: On Tue, 12 Mar 2024 17:21:24 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - merge master > - remove ucast from i/s/b to float > - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics > - fix typo > - modify test config > - clean code > - add more tests > - rearrange tests layout > - merge master > - Initial commit Thanks for the quick update. Two minor comments remain. Looks good otherwise. src/hotspot/cpu/riscv/assembler_riscv.hpp line 1284: > 1282: INSN(vfwcvt_f_f_v, 0b1010111, 0b001, 0b01100, 0b010010); > 1283: INSN(vfwcvt_rtz_x_f_v, 0b1010111, 0b001, 0b01111, 0b010010); > 1284: INSN(vfwcvt_rtz_xu_f_v, 0b1010111, 0b001, 0b01110, 0b010010); I see no use of these newly added assembler functions. So test coverage would be an issue. Maybe add them in the future when they are really needed? src/hotspot/cpu/riscv/riscv_v.ad line 3215: > 3213: %} > 3214: > 3215: instruct vcvtUBtoX_extend(vReg dst, vReg src) %{ Personally, I don't like the `_extend` suffix in the instruct name. I prefer names like `vzeroExtBtoX` which make it explicit that this will zero-extend the vector elements. Or simply `vcvtUBtoX`. ------------- PR Review: https://git.openjdk.org/jdk/pull/18040#pullrequestreview-1934187709 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523264264 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523273365 From chagedorn at openjdk.org Wed Mar 13 14:01:24 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 13 Mar 2024 14:01:24 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v5] In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 14:22:11 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > format Thanks for your review Tobias! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18080#issuecomment-1994470269 From chagedorn at openjdk.org Wed Mar 13 14:01:25 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 13 Mar 2024 14:01:25 GMT Subject: Integrated: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... This pull request has now been integrated. Changeset: 7d8561d5 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/7d8561d56bf064e388417530b9b71755e4ac3f76 Stats: 137 lines in 5 files changed: 72 ins; 34 del; 31 mod 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class Reviewed-by: epeter, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/18080 From mli at openjdk.org Wed Mar 13 16:32:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:32:51 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v5] In-Reply-To: References: Message-ID: > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: add comments; refine code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/e1127c76..2afa8160 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=03-04 Stats: 36 lines in 2 files changed: 24 ins; 6 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From mli at openjdk.org Wed Mar 13 16:32:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:32:51 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:55:55 GMT, Emanuel Peter wrote: >> It's testing following cases: >> 1. all the `e` range, e.g. for double it's 11 bits, for float it's 8 bits >> 2. for `f` I add a special value `0` explicitly `fis[fidx++] = 0;` >> 3. for sign, both `+` and `-` are tested. >> >> So, yes, it will test cases like `+/- 0, infty, NaN`. > > Can you refactor or at least comment the code a little better, or use more expressive variable names? > I'd have to spend a bit of time to understand your generation method here, and if I think that it is exhaustive and covers the special cases with enough frequency. Sure, I will add some comments to illustrate how it works. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1523575330 From mli at openjdk.org Wed Mar 13 16:32:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:32:51 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:55:38 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> refine code; fix bug > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 110: > >> 108: fis[fidx] = 1 << fidx; >> 109: } >> 110: fis[fidx++] = 0; > > The zero is now always in the same spot. What if vectorization messes up only in a specific slot, and then never encounters that zero? We would maybe never see a zero in that bad spot. Good point, will make it random, hope this will resolve the issue. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1523577070 From mli at openjdk.org Wed Mar 13 16:39:15 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:39:15 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:56:58 GMT, Emanuel Peter wrote: > Thanks for adjusting the IR rules! > > I still have trouble reviewing your input value generation, and a few other comments. > > Thanks for the work you are putting in, I really appreciate it ? Thanks for your suggestion and patient reviewing! :) I just updated the patch, please have a look again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17753#issuecomment-1994917255 From mli at openjdk.org Wed Mar 13 16:45:19 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:45:19 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v5] In-Reply-To: References: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> Message-ID: On Wed, 13 Mar 2024 13:29:32 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - merge master >> - remove ucast from i/s/b to float >> - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics >> - fix typo >> - modify test config >> - clean code >> - add more tests >> - rearrange tests layout >> - merge master >> - Initial commit > > src/hotspot/cpu/riscv/assembler_riscv.hpp line 1284: > >> 1282: INSN(vfwcvt_f_f_v, 0b1010111, 0b001, 0b01100, 0b010010); >> 1283: INSN(vfwcvt_rtz_x_f_v, 0b1010111, 0b001, 0b01111, 0b010010); >> 1284: INSN(vfwcvt_rtz_xu_f_v, 0b1010111, 0b001, 0b01110, 0b010010); > > I see no use of these newly added assembler functions. So test coverage would be an issue. Maybe add them in the future when they are really needed? Sure, will fix. > src/hotspot/cpu/riscv/riscv_v.ad line 3215: > >> 3213: %} >> 3214: >> 3215: instruct vcvtUBtoX_extend(vReg dst, vReg src) %{ > > Personally, I don't like the `_extend` suffix in the instruct name. I prefer names like `vzeroExtBtoX` which make it explicit that this will zero-extend the vector elements. Or simply `vcvtUBtoX`. Agree ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523595888 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523599044 From dchuyko at openjdk.org Wed Mar 13 16:58:23 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 16:58:23 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 07:43:35 GMT, Serguei Spitsyn wrote: >> Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: >> >> - Resolved master conflicts >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 > > src/hotspot/share/ci/ciEnv.cpp line 1144: > >> 1142: >> 1143: if (entry_bci == InvocationEntryBci) { >> 1144: if (TieredCompilation) { > > Just a naive question. Why this check has been removed? We want to let replacement of C2 method version by another C2 version of the same method in both tired and non-tired mode, which was not allowed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1523618332 From mli at openjdk.org Wed Mar 13 17:05:41 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 17:05:41 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v6] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: remove unused instructions; rename instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/179046b3..3fb61768 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=04-05 Stats: 12 lines in 2 files changed: 0 ins; 4 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From dchuyko at openjdk.org Wed Mar 13 17:14:28 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 17:14:28 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v30] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 48 commits: - Merge branch 'openjdk:master' into compiler-directives-force-update - Resolved master conflicts - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - ... and 38 more: https://git.openjdk.org/jdk/compare/5cae7d20...22b42347 ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=29 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From bkilambi at openjdk.org Wed Mar 13 17:20:16 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 13 Mar 2024 17:20:16 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 17:59:17 GMT, Emanuel Peter wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments for changes in backend rules and code style > > On a more visionary note: > > We should make sure that the actual `ReductionNode` gets moved out of the loop, when possible. > [JDK-8309647](https://bugs.openjdk.org/browse/JDK-8309647) [Vector API] Move Reduction outside loop when possible > We have an RFE for that, I have not yet have time or priority for it. > Of course the user can already move it out of the loop themselves. > > If the `ReductionNode` is out of the loop, then you usually just have a very cheap accumulation inside the loop, a `MulVF` for example. That would certainly be cheap enough to allow vectorization. > > So in that case, your optimization here should not just affect SVE, but also NEON and x86. > > Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. > > What do you think about that? > I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE. @eme64 Thank you so much for your review and feedback comments. Here are my responses to your questions - **> So in that case, your optimization here should not just affect SVE, but also NEON and x86.** >From what I understand, even when the reduction nodes are hoisted out of the loop, it would still generate AddReductionVF/VD nodes (fewer as we accumulate inside the loop now) and based on the choice of order the corresponding backend match rules (as included in this patch) should generate the expected instruction sequence. I don't think we would need any code changes for Neon/SVE after hoisting the reductions out of the loop. Please let me know if my understanding is incorrect. **> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.** Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. However, I am not well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). **> What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.** To give a background : The motivation for this patch was a significant performance degradation with SVE instructions compared to Neon for this testcase - FloatMaxVector.ADDLanes on a 128-bit SVE machine. It generates the SVE "fadda" instruction which is a strictly-ordered floating-point add reduction instruction. As it has a higher cost compared to the Neon implementation for FP add reduction, the performance with "fadda" was ~66% worse compared to Neon. As VectorAPI does not impose any rules on FP ordering, it could have generated the faster non-strict Neon instructions instead (on a 128-bit SVE machine). The reason we included a flag "requires_strict_order" to mark a reduction node as strictly-ordered or non-strictly ordered and generate the corresponding backend instructions. On aarch64, this patch only affects the 128-bit SVE machines. On SVE machines >128bits, the "fadda" instruction is generated as it was before this patch. There's no change on Neon as well - the non-strict Ne on instructions are generated with VectorAPI and no auto-vectorization is allowed for FP reduction nodes. Although this change was done keeping SVE in mind, this patch can help generate strictly ordered or non-strictly ordered code on other platforms as well (if they have different implementations for both) and also simplifies the IdealGraph a bit by removing the UnorderedReductionNode. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1995054812 From jbhateja at openjdk.org Wed Mar 13 17:25:23 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Mar 2024 17:25:23 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining Message-ID: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> This bug fix patch fixes following two issues:- 1) Removing memory operand based masked shift instruction selection patterns. As per Java specification section 15.19 shift count is rounded to fit within valid shift range by performing a bitwise AND with shift_mask this results into creation of an AndV IR after loading original mask into vector. Existing patterns will not be able to match this graph shape, extending the patten to cover AndV IR will associate memory operand with And operation and we will need to emit additional vectorAND instruction before shift instruction, existing memory operand patten for AndV already handle such a graph shape. 2) Crash occurs due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining because due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. ByteVector sliceTemplate(int origin, Vector v1) { ByteVector that = (ByteVector) v1; that.check(this); Objects.checkIndex(origin, length() + 1); VectorShuffle iota = iotaShuffle(); VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] iota = iotaShuffle(origin, 1, true); [B] return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] } Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. PredictedCallGenerators (Vector.compare) PredictedCallGenerators (Byte256Vector.compare) ParseGenerator (Byte256Vector.compare) [D] UncommonTrap (receiver other than Byte256Vector) PredictedCallGenerators (Byte128Vector.compare) ParseGenerator (Byte128Vector.compare) [E] UncommonTrap (receiver other than Byte128Vector) [F] PredictedCallGenerators (UncommonTrap) [converged state] = Merge JVM State orginating from C and E [G] Since top level receiver of sliceTemplate is Byte256Vector hence while executing the call generator for Byte128Vector.compare (see code at line E) compiler observes a mismatch b/w incoming argument species i.e. one argument is a 256 bit vector while other is 128 bit vector and throws an exception. At state convergence point (see code at line G), since one of the control path resulted into an exception, compiler propagates the JVM state of other control path comprising of Byte256Mask to downstream graph after bookkeeping the pending exception state. Similar to toVector API, iotaShuffle (see code at line B) is also lazily inlined and returns an abstract vector which results into bi-morphic inlining of rearrange. State convergence due to bi-morphic inlining of rearrange results into generation of an abstract ByteVector (Phi Byte128Vector Byte256Vector) which further causes bi-morphic inlining of blend API due to multiple profile based receiver types. Byte128Vector.blend [Java implementation](https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java#L412) explicitly cast incoming mask (Byte256Mask) by Byte128Mask type and this leads to creation of a[ null value](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/graphKit.cpp#L1417), this causes a crash while unboxing the mask during inline expansion of blend. To be safe here relaxing the null checking constraint during unboxing to disable intrinsification. All existing Vector API JTREG tests are passing with -XX:+StressIncrementalInlining at various AVX levels. Please review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining Changes: https://git.openjdk.org/jdk/pull/18282/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18282&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8319889 Stats: 51 lines in 2 files changed: 3 ins; 46 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18282/head:pull/18282 PR: https://git.openjdk.org/jdk/pull/18282 From dchuyko at openjdk.org Wed Mar 13 17:31:32 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 17:31:32 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: No dots in -r descriptions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14111/files - new: https://git.openjdk.org/jdk/pull/14111/files/22b42347..36c30367 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=30 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=29-30 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From dchuyko at openjdk.org Wed Mar 13 17:34:24 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 17:34:24 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: <1AM7yClR1fxPAHevEwcSaNI8hP-KM2oVTwYT41pyEo0=.06098a55-7e89-4214-bcbe-faef2965f4df@github.com> On Wed, 13 Mar 2024 07:48:35 GMT, Serguei Spitsyn wrote: >> Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: >> >> - Resolved master conflicts >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 > > src/hotspot/share/services/diagnosticCommand.cpp line 928: > >> 926: DCmdWithParser(output, heap), >> 927: _filename("filename", "Name of the directives file", "STRING", true), >> 928: _refresh("-r", "Refresh affected methods.", "BOOLEAN", false, "false") { > > Nit: The dot is not needed at the end, I think. The same applies to lines: 945, 970 and 987. Thanks, the dots were removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1523664980 From vlivanov at openjdk.org Wed Mar 13 19:45:14 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 13 Mar 2024 19:45:14 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining In-Reply-To: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: On Wed, 13 Mar 2024 17:19:40 GMT, Jatin Bhateja wrote: > This bug fix patch fixes following two issues:- > > 1) Removing memory operand based masked shift instruction selection patterns. As per Java specification section 15.19 shift count is rounded to fit within valid shift range by performing a bitwise AND with shift_mask this results into creation of an AndV IR after loading original mask into vector. Existing patterns will not be able to match this graph shape, extending the patten to cover AndV IR will associate memory operand with And operation and we will need to emit additional vectorAND instruction before shift instruction, existing memory operand patten for AndV already handle such a graph shape. > > 2) Crash occurs due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining because due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. > > > ByteVector sliceTemplate(int origin, Vector v1) { > ByteVector that = (ByteVector) v1; > that.check(this); > Objects.checkIndex(origin, length() + 1); > VectorShuffle iota = iotaShuffle(); > VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] > iota = iotaShuffle(origin, 1, true); [B] > return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] > } > > > > Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. > > PredictedCallGenerators (Vector.compare) > PredictedCallGenerators (Byte256Vector.compare) > ParseGenerator (Byte256Vector.compare) [D] > UncommonTrap (receiver other than Byte256Vector) > PredictedCallGenerators (Byte128Vector.compare) > ParseGenerator (Byte128Vector.compare) [E... Overall, both fixes look good. I suggest to handle the bugs separately (as 2 bug fixes). src/hotspot/share/opto/vectorIntrinsics.cpp line 164: > 162: Node* GraphKit::unbox_vector(Node* v, const TypeInstPtr* vbox_type, BasicType elem_bt, int num_elem, bool shuffle_to_vector) { > 163: assert(EnableVectorSupport, ""); > 164: const TypePtr* vbox_type_v = gvn().type(v)->isa_ptr(); You can use `isa_instptr()` and check for `nullptr` instead. const TypeInstPtr* vbox_type_v = gvn().type(v)->isa_instptr(); if (vbox_type_v == nullptr || vbox_type->instance_klass() != vbox_type_v->instance_klass()) { return nullptr; // arguments don't agree on vector shapes } ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18282#pullrequestreview-1935065932 PR Review Comment: https://git.openjdk.org/jdk/pull/18282#discussion_r1523825589 From sspitsyn at openjdk.org Wed Mar 13 20:45:47 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 20:45:47 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 16:55:28 GMT, Dmitry Chuyko wrote: >> src/hotspot/share/ci/ciEnv.cpp line 1144: >> >>> 1142: >>> 1143: if (entry_bci == InvocationEntryBci) { >>> 1144: if (TieredCompilation) { >> >> Just a naive question. Why this check has been removed? > > We want to let replacement of C2 method version by another C2 version of the same method in both tired and non-tired mode, which was not allowed Okay, thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1523889164 From sspitsyn at openjdk.org Wed Mar 13 20:56:45 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 20:56:45 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 17:31:32 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > No dots in -r descriptions The fix looks good. But I do not have an expertise in the compiler-specific part. So, a review from the Compiler team is still required. ------------- Marked as reviewed by sspitsyn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/14111#pullrequestreview-1935188358 From dholmes at openjdk.org Wed Mar 13 22:56:41 2024 From: dholmes at openjdk.org (David Holmes) Date: Wed, 13 Mar 2024 22:56:41 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: <8vGIupwyAcYvKCUiWiQJpBIWNsHL3b0kjo4miCNiM4g=.ddc38eec-05ff-42d4-8007-37dca0a7169f@github.com> On Thu, 7 Mar 2024 14:05:18 GMT, Oussama Louati wrote: >> Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo in error message in GenManyIndyIncorrectBootstrap.java > > I ran the JTreg test on this PR Head after full conversion of these tests, and nothing unusual happened, those aren't explicitly related to something else. @OssamaLouati thanks for the work you have put into doing this upgrade of the tests. That said I do have a fewconcerns about this change, but let me start by asking you what testing you have performed using the Oracle CI infrastructure? We need to see a full tier 1 - 8 test run on all platforms to ensure this switch is not introducing new timeout failures or OOM conditions, due to the use of this new API. Our`-Xcomp` runs in particular may be adversely affected depending on the number of classes involved compared to ASM. This is difficult to review because we lack Hotspot engineers who know the new ClassFile API. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17834#issuecomment-1996042042 From ddong at openjdk.org Thu Mar 14 05:16:39 2024 From: ddong at openjdk.org (Denghui Dong) Date: Thu, 14 Mar 2024 05:16:39 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v3] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:49:30 GMT, Denghui Dong wrote: >> Hi, >> >> Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. >> >> There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. >> >> I am not very sure if I have changed all the places that should be. >> >> Testing: fastdebug tier1-4 on Linux x64 > > Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: > > delete jmh src/hotspot/share/c1/c1_LinearScan.cpp line 5755: > 5753: bool LinearScanWalker::no_allocation_possible(Interval* cur) { > 5754: #ifdef X86 > 5755: #ifndef _LP64 rbp is callee-saved, so the following logic doesn't work. That'll slow down the allocation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18167#discussion_r1524251207 From ksakata at openjdk.org Thu Mar 14 06:05:37 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Thu, 14 Mar 2024 06:05:37 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:41:29 GMT, Koichi Sakata wrote: > This is a trivial change to remove an extra whitespace. > > A double whitespace is printed because method->print_short_name already adds a whitespace before the name. > > ### Test > > For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. > > > @Test > @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) > @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) > public static void test15(Object o) { > > > This change was only for testing, so I reverted back to the original code after the test. > > #### Execution Result > > Before the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" > - Failed comparison: [found] 1 = 11 [given] > - Matched node: > * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) > > > After the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Cons... Could someone please review this pull request? I'd like to have another reviewer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18181#issuecomment-1996588423 From fyang at openjdk.org Thu Mar 14 06:55:38 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 14 Mar 2024 06:55:38 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v6] In-Reply-To: References: Message-ID: <9165g_CT_MsS8iXGsBQyyqaHcfcLZB38Ivziz4Ix3TI=.3887b0ad-eec7-4456-9bb7-fb4a3e8802b1@github.com> On Wed, 13 Mar 2024 17:05:41 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > remove unused instructions; rename instructions test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastRVV.java line 32: > 30: /* > 31: * @test > 32: * @bug 8259610 You might want to change this bug id. test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java line 373: > 371: // to X 64 > 372: makePair(FSPEC64, ISPEC64), > 373: makePair(FSPEC64, ISPEC64, true), Does it make sense to specify `unsignedCast` to true when one of the operand is of type VectorSpecies? I don't see test items like this for other targets like aarch64 neon/sve. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524314068 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524313821 From chagedorn at openjdk.org Thu Mar 14 07:14:02 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 07:14:02 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If Message-ID: This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. #### How `create_bool_from_template_assertion_predicate()` Works Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. #### Missing Visited Set The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: ... | E | D / \ B C \ / A DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... With each diamond, the number of revisits of each node above doubles. #### Endless DFS in Edge-Cases In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because background compilation is disabled). #### New DFS Implementation The new algorithm uses again an iterative DFS walk but uses a visited set to avoid this problem. The implementation is found in the new class `DataNodesOnPathToTargets`. It is written in a generic way such that it could potentially be reused at some point (i.e. using "source" and "target" instead of "opaque4" and "opaque loop nodes"). #### New Template Assertion Predicate Expression Cloning Algorithm There is now a new class `TemplateAssertionPredicateExpression` that does the cloning of the Template Assertion Predicate Expression in the following way: 1. Collect nodes to clone with `DataNodesOnPathToTargets`. 2. Clone the collected nodes by reusing and extending `DataNodeGraph`. #### Only Replacing Usages in Loop Unswitching and Split If This patch only replaces the usages of `create_bool_from_template_assertion_predicate()` in Loop Unswitching and Split if which need an identical copy of Template Assertion Predicate Expressions. In JDK-8327111, I will replace the remaining usages which require a transformation of the `OpaqueLoop*Nodes` by adding additional strategies which implement the `TransformStrategyForOpaqueLoopNodes` interface. #### Other Work Left for https://github.com/openjdk/jdk/pull/16877 - Clean up `Split If` code to clone down Template Assertion Predicate Expressions - Removes `is_part_of_template_assertion_predicate_bool()` and `subgraph_has_opaque()` - More renaming and small refactoring Thanks, Christian ------------- Commit messages: - 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix pure cloning cases used for Loop Unswitching and Split If Changes: https://git.openjdk.org/jdk/pull/18293/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18293&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327110 Stats: 418 lines in 9 files changed: 407 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/18293.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18293/head:pull/18293 PR: https://git.openjdk.org/jdk/pull/18293 From epeter at openjdk.org Thu Mar 14 07:14:43 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Mar 2024 07:14:43 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert [v2] In-Reply-To: <-XVbBb-rqIblr2ytrCprcBv7Kg_TW4lpgeE8ZACVIuw=.e77a67bd-d038-4a08-9969-4ce6d3e27309@github.com> References: <-XVbBb-rqIblr2ytrCprcBv7Kg_TW4lpgeE8ZACVIuw=.e77a67bd-d038-4a08-9969-4ce6d3e27309@github.com> Message-ID: <9MmOxsPH9fmeyU5VCwyxfTSSVVTNuIzJhv3IYZ6zET8=.269f6889-7c0e-4503-a516-5f9d16c63015@github.com> On Tue, 12 Mar 2024 08:35:02 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java >> >> Co-authored-by: Christian Hagedorn > > Looks good to me. @rwestrel @chhagedorn @vnkozlov thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18200#issuecomment-1996707419 From epeter at openjdk.org Thu Mar 14 07:14:44 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Mar 2024 07:14:44 GMT Subject: Integrated: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert In-Reply-To: References: Message-ID: <_ImYIDVeQUIpMQQyzdCT-Fkfe4J67_mWZhQ9_hiJ8Xw=.6af27b31-b29b-405d-aba1-54c7a3120ee5@github.com> On Mon, 11 Mar 2024 15:56:38 GMT, Emanuel Peter wrote: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... This pull request has now been integrated. Changeset: fadc4b19 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/fadc4b197e927cfa1814fe6cb65ee04b3bd4b0c2 Stats: 64 lines in 2 files changed: 61 ins; 2 del; 1 mod 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert Reviewed-by: kvn, chagedorn, roland ------------- PR: https://git.openjdk.org/jdk/pull/18200 From tholenstein at openjdk.org Thu Mar 14 08:41:50 2024 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 14 Mar 2024 08:41:50 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 17:31:32 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > No dots in -r descriptions Looks good to me too (Compiler Team) ------------- Marked as reviewed by tholenstein (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/14111#pullrequestreview-1936031298 From shade at openjdk.org Thu Mar 14 09:21:38 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 14 Mar 2024 09:21:38 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total Thanks! Any additional reviews, maybe @TobiHartmann or @chhagedorn ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18249#issuecomment-1996994554 From dchuyko at openjdk.org Thu Mar 14 09:22:47 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 14 Mar 2024 09:22:47 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 17:31:32 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > No dots in -r descriptions Thank you, Seguei and Tobias. ------------- PR Comment: https://git.openjdk.org/jdk/pull/14111#issuecomment-1996995661 From dchuyko at openjdk.org Thu Mar 14 09:26:00 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 14 Mar 2024 09:26:00 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v32] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 50 commits: - Merge branch 'openjdk:master' into compiler-directives-force-update - No dots in -r descriptions - Merge branch 'openjdk:master' into compiler-directives-force-update - Resolved master conflicts - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - ... and 40 more: https://git.openjdk.org/jdk/compare/49ce85fa...eb4ed2ea ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=31 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From mli at openjdk.org Thu Mar 14 09:30:10 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 09:30:10 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v6] In-Reply-To: <9165g_CT_MsS8iXGsBQyyqaHcfcLZB38Ivziz4Ix3TI=.3887b0ad-eec7-4456-9bb7-fb4a3e8802b1@github.com> References: <9165g_CT_MsS8iXGsBQyyqaHcfcLZB38Ivziz4Ix3TI=.3887b0ad-eec7-4456-9bb7-fb4a3e8802b1@github.com> Message-ID: On Thu, 14 Mar 2024 06:47:42 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unused instructions; rename instructions > > test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastRVV.java line 32: > >> 30: /* >> 31: * @test >> 32: * @bug 8259610 > > You might want to change this bug id. fixed. > test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java line 373: > >> 371: // to X 64 >> 372: makePair(FSPEC64, ISPEC64), >> 373: makePair(FSPEC64, ISPEC64, true), > > Does it make sense to specify `unsignedCast` to true when one of the operand is of type VectorSpecies? I don't see test items like this for other targets like aarch64 neon/sve. Thanks for catching, fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524521620 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524521562 From mli at openjdk.org Thu Mar 14 09:30:09 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 09:30:09 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v7] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: remove unused test cases; fix bug id ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/3fb61768..7844a987 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=05-06 Stats: 29 lines in 2 files changed: 0 ins; 27 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From fyang at openjdk.org Thu Mar 14 09:37:42 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 14 Mar 2024 09:37:42 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v7] In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 09:30:09 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > remove unused test cases; fix bug id Updated change LGTM. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18040#pullrequestreview-1936159000 From chagedorn at openjdk.org Thu Mar 14 09:48:39 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 09:48:39 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total Looks reasonable to me, too. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18249#pullrequestreview-1936183471 From shade at openjdk.org Thu Mar 14 10:29:49 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 14 Mar 2024 10:29:49 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total All right, thanks! I checked that both fastdebug and release binaries work well with java.base tests too. It also improves large CTW run times significantly. We are able to CTW 130K JARs in 24 hours now, about 3x improvement. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18249#issuecomment-1997118203 From shade at openjdk.org Thu Mar 14 10:29:49 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 14 Mar 2024 10:29:49 GMT Subject: Integrated: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total This pull request has now been integrated. Changeset: 1281e18f Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/1281e18f1447848d7eb5e3bde508ac002b4c390d Stats: 26 lines in 2 files changed: 24 ins; 1 del; 1 mod 8325613: CTW: Stale method cleanup requires GC after Sweeper removal Reviewed-by: roland, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/18249 From mli at openjdk.org Thu Mar 14 11:23:56 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:23:56 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v6] In-Reply-To: References: Message-ID: > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Fix test failure in TestRoundVectorDoubleRandom.java - Merge branch 'master' into round-v-exhaustive-tests - add comments; refine code - refine code; fix bug - Merge branch 'master' into round-v-exhaustive-tests - fix issue - mv tests - use IR framework to construct the random tests - Initial commit ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/2afa8160..3f50c062 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=04-05 Stats: 98690 lines in 2066 files changed: 16605 ins; 76003 del; 6082 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From mli at openjdk.org Thu Mar 14 11:25:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:25:51 GMT Subject: Integrated: 8321021: RISC-V: C2 VectorUCastB2X In-Reply-To: References: Message-ID: <9R07HBfLtn-1M6gLoa-_-a1fWEYjrXWvyZjM_2mNIp0=.507eeabf-4b5a-4364-8326-1e05fdc481a2@github.com> On Wed, 28 Feb 2024 11:07:39 GMT, Hamlin Li wrote: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ This pull request has now been integrated. Changeset: 1d34b74a Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/1d34b74a64fba8d0d58dcbccc416379a4c915738 Stats: 639 lines in 5 files changed: 605 ins; 11 del; 23 mod 8321021: RISC-V: C2 VectorUCastB2X 8321023: RISC-V: C2 VectorUCastS2X 8321024: RISC-V: C2 VectorUCastI2X Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/18040 From mli at openjdk.org Thu Mar 14 11:25:49 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:25:49 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v7] In-Reply-To: References: Message-ID: <9tNyag92zj8kR4JWbk3KCRMMqfTNwzEFu51vumlQPUI=.85c89672-a6bc-4ef2-a4b9-3faeea6e82d0@github.com> On Thu, 14 Mar 2024 09:35:02 GMT, Fei Yang wrote: > Updated change LGTM. Thanks. Thanks @RealFYang @zifeihan for your reviewing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18040#issuecomment-1997222178 From mli at openjdk.org Thu Mar 14 11:41:52 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:41:52 GMT Subject: RFR: 8321010: RISC-V: C2 RoundVF [v3] In-Reply-To: References: Message-ID: > Hi, > Can you have a review on this patch to add RoundVF/RoundDF intrinsics? > Thanks! > > ## Tests > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectRiscv64.java test/hotspot/jtreg/compiler/c2/cr6340864/TestFloatVect.java test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java test/hotspot/jtreg/compiler/floatingpoint/TestRound.java > > test/jdk/java/lang/Math/RoundTests.java Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - merge master - fix space - add tests - add test cases - v2: (src + 0.5) + rdn - Fix corner cases - Merge branch 'master' into round-F+D-v - refine code - RoundVF/D: Initial commit ------------- Changes: https://git.openjdk.org/jdk/pull/17745/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17745&range=02 Stats: 234 lines in 7 files changed: 230 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17745.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17745/head:pull/17745 PR: https://git.openjdk.org/jdk/pull/17745 From galder at openjdk.org Thu Mar 14 12:22:58 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 14 Mar 2024 12:22:58 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 18:16:24 GMT, Dean Long wrote: > IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support? I might be wrong but the way I understood the code, I think other platforms will have no issue with that with the way the current code works: The check I added in `Compiler::is_intrinsic_supported()` means that for clone calls in other platforms it would return false. If that returns false, `AbstractCompiler::is_intrinsic_available` will return false. Then this means that in `GraphBuilder::try_inline_intrinsics` `is_available` would be false, in which case the method will always return false and `build_graph_for_intrinsic` will not be called. `GraphBuilder::append_alloc_array_copy` is called from `build_graph_for_intrinsic`, so I don't see a danger of that being called for non-supported platforms. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1997319836 From dchuyko at openjdk.org Thu Mar 14 12:41:53 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 14 Mar 2024 12:41:53 GMT Subject: Integrated: 8309271: A way to align already compiled methods with compiler directives In-Reply-To: References: Message-ID: On Wed, 24 May 2023 00:38:27 GMT, Dmitry Chuyko wrote: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... This pull request has now been integrated. Changeset: c879627d Author: Dmitry Chuyko URL: https://git.openjdk.org/jdk/commit/c879627dbd7e9295d44f19ef237edb5de10805d5 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod 8309271: A way to align already compiled methods with compiler directives Reviewed-by: apangin, sspitsyn, tholenstein ------------- PR: https://git.openjdk.org/jdk/pull/14111 From mbaesken at openjdk.org Thu Mar 14 12:50:46 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Thu, 14 Mar 2024 12:50:46 GMT Subject: RFR: JDK-8328165: improve assert(idx < _maxlrg) failed: oob Message-ID: The assert in chaitin.hpp assert(idx < _maxlrg) failed: oob could be improved, it should show more information. ------------- Commit messages: - JDK-8328165 Changes: https://git.openjdk.org/jdk/pull/18302/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18302&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328165 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18302.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18302/head:pull/18302 PR: https://git.openjdk.org/jdk/pull/18302 From mdoerr at openjdk.org Thu Mar 14 13:08:38 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 14 Mar 2024 13:08:38 GMT Subject: RFR: JDK-8328165: improve assert(idx < _maxlrg) failed: oob In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 12:45:20 GMT, Matthias Baesken wrote: > The assert in chaitin.hpp > assert(idx < _maxlrg) failed: oob > could be improved, it should show more information. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18302#pullrequestreview-1936648175 From roland at openjdk.org Thu Mar 14 14:28:01 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Mar 2024 14:28:01 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert Message-ID: In `IfNode::fold_compares_helper()`, `adjusted_val` is: (SubI (AddI top constant) 0) which is then transformed to the `top` node. The code, next, tries to destroy the `adjusted_val` node i.e. the `top` node. That results in the assert failure. Given We're trying to fold 2 ifs in a dying part of the graph, the fix is straightforward: test `adjusted_val` for top and bail out from the transformation if that's the case. ------------- Commit messages: - whitespaces - fix & test Changes: https://git.openjdk.org/jdk/pull/18305/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18305&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8308660 Stats: 70 lines in 2 files changed: 70 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18305.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18305/head:pull/18305 PR: https://git.openjdk.org/jdk/pull/18305 From duke at openjdk.org Thu Mar 14 15:13:42 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 14 Mar 2024 15:13:42 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 14:05:18 GMT, Oussama Louati wrote: >> Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo in error message in GenManyIndyIncorrectBootstrap.java > > I ran the JTreg test on this PR Head after full conversion of these tests, and nothing unusual happened, those aren't explicitly related to something else. > @OssamaLouati thanks for the work you have put into doing this upgrade of the tests. That said I do have a fewconcerns about this change, but let me start by asking you what testing you have performed using the Oracle CI infrastructure? We need to see a full tier 1 - 8 test run on all platforms to ensure this switch is not introducing new timeout failures or OOM conditions, due to the use of this new API. Our`-Xcomp` runs in particular may be adversely affected depending on the number of classes involved compared to ASM. > > This is difficult to review because we lack Hotspot engineers who know the new ClassFile API. I started running the full tier1-8 tests on mach5, I will wait until the jobs finish and update the Openjdk bug with the confidential comment it with the link. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17834#issuecomment-1997685719 From chagedorn at openjdk.org Thu Mar 14 15:16:40 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 15:16:40 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 14:12:09 GMT, Roland Westrelin wrote: > In `IfNode::fold_compares_helper()`, `adjusted_val` is: > > > (SubI (AddI top constant) 0) > > > which is then transformed to the `top` node. The code, next, tries to > destroy the `adjusted_val` node i.e. the `top` node. That results in > the assert failure. Given We're trying to fold 2 ifs in a dying part > of the graph, the fix is straightforward: test `adjusted_val` for top > and bail out from the transformation if that's the case. Otherwise, looks good! test/hotspot/jtreg/compiler/c2/TestFoldIfRemovesTopNode.java line 29: > 27: * @summary C2 compilation hits 'node must be dead' assert > 28: * @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:+StressIGVN -XX:StressSeed=242006623 TestFoldIfRemovesTopNode > 29: * @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:+StressIGVN TestFoldIfRemovesTopNode You should add `-XX:+UnlockDiagnosticVMOptions` to run with product and either add an `-XX:+IgnoreUnrecognizedVMOptions` or `@requires vm.compiler2.enabled` since `StressIGVN` is a C2 flag. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18305#pullrequestreview-1937005031 PR Review Comment: https://git.openjdk.org/jdk/pull/18305#discussion_r1525053127 From chagedorn at openjdk.org Thu Mar 14 15:18:38 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 15:18:38 GMT Subject: RFR: JDK-8328165: improve assert(idx < _maxlrg) failed: oob In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 12:45:20 GMT, Matthias Baesken wrote: > The assert in chaitin.hpp > assert(idx < _maxlrg) failed: oob > could be improved, it should show more information. Looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18302#pullrequestreview-1937017655 From roland at openjdk.org Thu Mar 14 16:02:12 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Mar 2024 16:02:12 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: References: Message-ID: <5is7paUMlLqgbXgcmxPqZrtWbMkiD7edyN8kJGkzSrY=.9abcc72b-7f50-432c-a69c-964e09a81656@github.com> On Thu, 14 Mar 2024 15:13:43 GMT, Christian Hagedorn wrote: > Otherwise, looks good! Thanks for reviewing this. > test/hotspot/jtreg/compiler/c2/TestFoldIfRemovesTopNode.java line 29: > >> 27: * @summary C2 compilation hits 'node must be dead' assert >> 28: * @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:+StressIGVN -XX:StressSeed=242006623 TestFoldIfRemovesTopNode >> 29: * @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:+StressIGVN TestFoldIfRemovesTopNode > > You should add `-XX:+UnlockDiagnosticVMOptions` to run with product and either add an `-XX:+IgnoreUnrecognizedVMOptions` or `@requires vm.compiler2.enabled` since `StressIGVN` is a C2 flag. Right, good catch! Fixed in the new commit. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18305#issuecomment-1997793463 PR Review Comment: https://git.openjdk.org/jdk/pull/18305#discussion_r1525130073 From roland at openjdk.org Thu Mar 14 16:02:12 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Mar 2024 16:02:12 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: References: Message-ID: > In `IfNode::fold_compares_helper()`, `adjusted_val` is: > > > (SubI (AddI top constant) 0) > > > which is then transformed to the `top` node. The code, next, tries to > destroy the `adjusted_val` node i.e. the `top` node. That results in > the assert failure. Given We're trying to fold 2 ifs in a dying part > of the graph, the fix is straightforward: test `adjusted_val` for top > and bail out from the transformation if that's the case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: fix test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18305/files - new: https://git.openjdk.org/jdk/pull/18305/files/32a52ecd..f4703ea7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18305&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18305&range=00-01 Stats: 5 lines in 1 file changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18305.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18305/head:pull/18305 PR: https://git.openjdk.org/jdk/pull/18305 From kvn at openjdk.org Thu Mar 14 17:02:38 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 14 Mar 2024 17:02:38 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 16:02:12 GMT, Roland Westrelin wrote: >> In `IfNode::fold_compares_helper()`, `adjusted_val` is: >> >> >> (SubI (AddI top constant) 0) >> >> >> which is then transformed to the `top` node. The code, next, tries to >> destroy the `adjusted_val` node i.e. the `top` node. That results in >> the assert failure. Given We're trying to fold 2 ifs in a dying part >> of the graph, the fix is straightforward: test `adjusted_val` for top >> and bail out from the transformation if that's the case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > fix test Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18305#pullrequestreview-1937283076 From qamai at openjdk.org Fri Mar 15 04:58:18 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 15 Mar 2024 04:58:18 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v5] In-Reply-To: References: Message-ID: > Hi, > > This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. > > In general, a TypeInt/Long represents a set of values x that satisfies: x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (~x & ones) == 0. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must normalize the constraints (tighten the constraints so that they are optimal) before constructing a TypeInt/Long instance. > > This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. > > Please kindly review, thanks a lot. > > Testing > > - [x] GHA > - [x] Linux x64, tier 1-4 Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - Merge branch 'master' into unsignedbounds - fix release build - add comments, group arguments to reduce C-style reference passing arguments - fix tests, add verify - add unit tests - fix template parameter - refactor - implement unsigned bounds and known bits ------------- Changes: https://git.openjdk.org/jdk/pull/17508/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=04 Stats: 1476 lines in 16 files changed: 919 ins; 286 del; 271 mod Patch: https://git.openjdk.org/jdk/pull/17508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17508/head:pull/17508 PR: https://git.openjdk.org/jdk/pull/17508 From qamai at openjdk.org Fri Mar 15 04:58:34 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 15 Mar 2024 04:58:34 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v48] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 92 commits: - Merge branch 'master' into unsignedDiv - further clarify variable meanings - just be simple - suggestion - update include order and license year - parentheses - another round of reviews - power of 2 - test for round down - address reviews - ... and 82 more: https://git.openjdk.org/jdk/compare/bdd1aebe...ed0ca1c3 ------------- Changes: https://git.openjdk.org/jdk/pull/9947/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=47 Stats: 2389 lines in 13 files changed: 1909 ins; 289 del; 191 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Fri Mar 15 05:07:02 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 15 Mar 2024 05:07:02 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v6] In-Reply-To: References: Message-ID: > Hi, > > This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. > > In general, a TypeInt/Long represents a set of values x that satisfies: x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (~x & ones) == 0. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must normalize the constraints (tighten the constraints so that they are optimal) before constructing a TypeInt/Long instance. > > This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. > > Please kindly review, thanks a lot. > > Testing > > - [x] GHA > - [x] Linux x64, tier 1-4 Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: add comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17508/files - new: https://git.openjdk.org/jdk/pull/17508/files/ffb0abd7..6e2e6c56 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=04-05 Stats: 14 lines in 2 files changed: 6 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/17508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17508/head:pull/17508 PR: https://git.openjdk.org/jdk/pull/17508 From qamai at openjdk.org Fri Mar 15 06:20:04 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 15 Mar 2024 06:20:04 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v49] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: fix tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/ed0ca1c3..3068d7e5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=48 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=47-48 Stats: 6 lines in 1 file changed: 0 ins; 4 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From mbaesken at openjdk.org Fri Mar 15 08:13:39 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 15 Mar 2024 08:13:39 GMT Subject: RFR: JDK-8328165: improve assert(idx < _maxlrg) failed: oob In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 12:45:20 GMT, Matthias Baesken wrote: > The assert in chaitin.hpp > assert(idx < _maxlrg) failed: oob > could be improved, it should show more information. Thanks for the reviews ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18302#issuecomment-1999133791 From mbaesken at openjdk.org Fri Mar 15 08:13:39 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 15 Mar 2024 08:13:39 GMT Subject: Integrated: JDK-8328165: improve assert(idx < _maxlrg) failed: oob In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 12:45:20 GMT, Matthias Baesken wrote: > The assert in chaitin.hpp > assert(idx < _maxlrg) failed: oob > could be improved, it should show more information. This pull request has now been integrated. Changeset: d57bdd85 Author: Matthias Baesken URL: https://git.openjdk.org/jdk/commit/d57bdd85ab5e45a2ecfce0c022da067ac30bb80d Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8328165: improve assert(idx < _maxlrg) failed: oob Reviewed-by: mdoerr, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/18302 From bkilambi at openjdk.org Fri Mar 15 11:03:08 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 15 Mar 2024 11:03:08 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: > Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. > > To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. > > With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. > > [AArch64] > On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. > > This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. > > No effects on other platforms. > > [Performance] > FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). > > ADDLanes > > Benchmark Before After Unit > FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms > > > Final code is as below: > > Before: > ` fadda z17.s, p7/m, z17.s, z16.s > ` > After: > > faddp v17.4s, v21.4s, v21.4s > faddp s18, v17.2s > fadd s18, s18, s19 > > > > > [Test] > Full jtreg passed on AArch64 and x86. > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 > [2] https://bugs.openjdk.org/browse/JDK-8275275 > [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Naming changes: replace strict/non-strict with more technical terms ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18034/files - new: https://git.openjdk.org/jdk/pull/18034/files/f8f79ac2..4aed4b50 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=01-02 Stats: 69 lines in 6 files changed: 2 ins; 0 del; 67 mod Patch: https://git.openjdk.org/jdk/pull/18034.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18034/head:pull/18034 PR: https://git.openjdk.org/jdk/pull/18034 From roland at openjdk.org Fri Mar 15 15:40:53 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 15 Mar 2024 15:40:53 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 15:31:03 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test > > Thanks for the update, looks good! Thanks for the reviews @chhagedorn @vnkozlov ------------- PR Comment: https://git.openjdk.org/jdk/pull/18305#issuecomment-1999921655 From chagedorn at openjdk.org Fri Mar 15 15:40:53 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Mar 2024 15:40:53 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 16:02:12 GMT, Roland Westrelin wrote: >> In `IfNode::fold_compares_helper()`, `adjusted_val` is: >> >> >> (SubI (AddI top constant) 0) >> >> >> which is then transformed to the `top` node. The code, next, tries to >> destroy the `adjusted_val` node i.e. the `top` node. That results in >> the assert failure. Given We're trying to fold 2 ifs in a dying part >> of the graph, the fix is straightforward: test `adjusted_val` for top >> and bail out from the transformation if that's the case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > fix test Thanks for the update, looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18305#pullrequestreview-1939247441 From aturbanov at openjdk.org Sat Mar 16 22:17:55 2024 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Sat, 16 Mar 2024 22:17:55 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 16:02:12 GMT, Roland Westrelin wrote: >> In `IfNode::fold_compares_helper()`, `adjusted_val` is: >> >> >> (SubI (AddI top constant) 0) >> >> >> which is then transformed to the `top` node. The code, next, tries to >> destroy the `adjusted_val` node i.e. the `top` node. That results in >> the assert failure. Given We're trying to fold 2 ifs in a dying part >> of the graph, the fix is straightforward: test `adjusted_val` for top >> and bail out from the transformation if that's the case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > fix test test/hotspot/jtreg/compiler/c2/TestFoldIfRemovesTopNode.java line 64: > 62: if (flag) { > 63: k = new int[k].length; > 64: int j = k + 3; Suggestion: int j = k + 3; ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18305#discussion_r1527142590 From jbhateja at openjdk.org Sat Mar 16 22:18:34 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 16 Mar 2024 22:18:34 GMT Subject: RFR: 8328309: Remove malformed masked shift instruction selection patterns Message-ID: - This bug fix patch removes existing masked logical right, logical left and arithmetic right shift memory operand patterns which do not take into account shift count rounding bitwise AND operation, this limits their applicability to generic cases. - [JDK-8319889](https://bugs.openjdk.org/browse/JDK-8319889) also reported unhandled operation assertion failure seen with some shift operation test points in Vector API JTREG tests along with -XX:+StressIncrementalInlining flag. Best Regards, Jatin ------------- Commit messages: - 8328309: Remove malformed masked shift instruction selection patterns Changes: https://git.openjdk.org/jdk/pull/18338/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18338&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328309 Stats: 45 lines in 1 file changed: 0 ins; 45 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18338.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18338/head:pull/18338 PR: https://git.openjdk.org/jdk/pull/18338 From jbhateja at openjdk.org Sat Mar 16 22:19:10 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 16 Mar 2024 22:19:10 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining [v2] In-Reply-To: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: > This bug fix patch fixes a crash occurring due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. > > > ByteVector sliceTemplate(int origin, Vector v1) { > ByteVector that = (ByteVector) v1; > that.check(this); > Objects.checkIndex(origin, length() + 1); > VectorShuffle iota = iotaShuffle(); > VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] > iota = iotaShuffle(origin, 1, true); [B] > return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] > } > > > > Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. > > PredictedCallGenerators (Vector.compare) > PredictedCallGenerators (Byte256Vector.compare) > ParseGenerator (Byte256Vector.compare) [D] > UncommonTrap (receiver other than Byte256Vector) > PredictedCallGenerators (Byte128Vector.compare) > ParseGenerator (Byte128Vector.compare) [E] > UncommonTrap (receiver other than Byte128Vector) [F] > PredictedCallGenerators (UncommonTrap) > [converged state] = Merge JVM State orginating from C and E [G] > > Since top level receiver of sliceTemplate is Byte256Vector hence while executing the call generator for Byte128Vector.compare (see code at line E) compiler observes a mismatch b/w incoming argument species i.e. one argument is a 256 bit vector while other is 128 bit vector and throws an exception. > > At state convergence point (see code at line G), since one of the control path resulted into an exception, compiler propagates ... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Restricting the patch to only bi-morphic inlining crash fix. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18282/files - new: https://git.openjdk.org/jdk/pull/18282/files/43c1e399..39edea6f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18282&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18282&range=00-01 Stats: 51 lines in 2 files changed: 46 ins; 3 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18282/head:pull/18282 PR: https://git.openjdk.org/jdk/pull/18282 From jbhateja at openjdk.org Sat Mar 16 22:19:17 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 16 Mar 2024 22:19:17 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining [v2] In-Reply-To: References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: On Wed, 13 Mar 2024 19:42:14 GMT, Vladimir Ivanov wrote: > Overall, both fixes look good. > > I suggest to handle the bugs separately (as 2 bug fixes). Removed shift pattern related changes and created a separate JBS entry for it https://bugs.openjdk.org/browse/JDK-8328309 ------------- PR Comment: https://git.openjdk.org/jdk/pull/18282#issuecomment-2002079656 From bkilambi at openjdk.org Sat Mar 16 22:19:50 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Sat, 16 Mar 2024 22:19:50 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: <1E4ZVXo7mL8TvOTDVibiByyGlC19DG6B7ZwSQtZh3C4=.6a99c0c1-6181-4214-a9fa-ecdb4c3bbd22@github.com> On Fri, 15 Mar 2024 11:03:08 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Naming changes: replace strict/non-strict with more technical terms There seem to be a couple of failures. Linux cross-compile fails but this seems to be more of a build problem than something caused by my patch. There's one JTREG test failure on x86 - tools/javac/patterns/Exhaustiveness.java which seemed to fail due to - Agent error: java.lang.Exception: Agent 8 timed out with a timeout of 480 seconds; I re-ran this specific test manually on an x86 machine and this test passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-2000337142 From dlong at openjdk.org Sat Mar 16 22:20:54 2024 From: dlong at openjdk.org (Dean Long) Date: Sat, 16 Mar 2024 22:20:54 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:12:12 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. OK, I missed the is_intrinsic_supported change. The platform-specific changes should probably have a comment saying they are for clone support. Also, I was hoping there was a way to minimize platform-specific changes, maybe by handing the force_reexecute inheritance in state_for(), and putting the state in x->state() instead of x->state_before(). ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-2000269809 From qamai at openjdk.org Mon Mar 18 04:38:26 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 18 Mar 2024 04:38:26 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 02:05:39 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update the package name for tests `(x & m) u< m + 1` is false for `m = -1`, right? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2002913926 From ksakata at openjdk.org Mon Mar 18 05:02:25 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Mon, 18 Mar 2024 05:02:25 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 20:01:28 GMT, Vladimir Kozlov wrote: >> This pull request removes an unnecessary directive. >> >> There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. >> >> I built OpenJDK with Zero VM as a test. It was successful. >> >> >> $ ./configure --with-jvm-variants=zero --enable-debug >> $ make images >> $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version >> openjdk version "23-internal" 2024-09-17 >> OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) >> OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) >> >> >> It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. > > This was from these changes [JDK-8000780](https://github.com/openjdk/jdk/commit/e184d5cc4ec66640366d2d30d8dfaba74a1003a7) > > May be @rkennke remember why he added it. May be for some debugging purpose. @vnkozlov Thank you for the information! How about just removing the`#ifndef` conditional directive? Because it doesn't work, but it's not clear to us if the `#define noreg` doesn't work. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-2002931827 From jbhateja at openjdk.org Mon Mar 18 08:23:29 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 18 Mar 2024 08:23:29 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 17:17:36 GMT, Bhavana Kilambi wrote: > > **> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.** Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. I could have tried but I am not very well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). Maybe Jatin/Sandhya can help here ? I am unable to tag them for some reason. > I agree, as per following Vector API documentation, backends are free to deviate from JVM specification (15.18.2) which enforces non-associativity on FP operation. https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc:~:text=Certain%20associative%20operations,this%20machine%20code. You may create a follow-up RFE for x86 side of optimization and assign it to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-2003174869 From bkilambi at openjdk.org Mon Mar 18 09:35:37 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 18 Mar 2024 09:35:37 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 11:03:08 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Naming changes: replace strict/non-strict with more technical terms Thank you Jatin. I will do that once this PR is approved and merged upstream. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-2003330183 From epeter at openjdk.org Mon Mar 18 12:59:32 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 18 Mar 2024 12:59:32 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 11:03:08 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Naming changes: replace strict/non-strict with more technical terms Looks better already! Left a few comments, questions and suggestions. src/hotspot/cpu/aarch64/aarch64_vector.ad line 2887: > 2885: // Such nodes can only be generated by Vector API. > 2886: // 2. Non-associative (or strictly ordered) AddReductionVF, which can only be generated by > 2887: // auto-vectorization on SVE machine. I'd be careful with such strong statements. You can name them as examples, but the "only" may not stay true forever. We may for example at some point add some way to have a `Float.addAssociative`, to allow "fast-math" reductions in plain java, which can then be auto-vectorized with a non-strict ordered implementation. I'm not sure if we will ever do that, but it is possible. src/hotspot/share/opto/loopopts.cpp line 4306: > 4304: } > 4305: > 4306: static bool is_unordered_reduction(Node* n) { I think you should rename all the "unordered" to "associative". I implemented this, and I think it would now be nicer to make that change. src/hotspot/share/opto/vectorIntrinsics.cpp line 2: > 1: /* > 2: * Copyright (c) 2020, 2024, Oracle and/or its affiliates. All rights reserved. Sometimes you update ARM copyright, sometimes the Oracle one. Is that intended? src/hotspot/share/opto/vectornode.cpp line 1332: > 1330: case Op_AddReductionVL: return new AddReductionVLNode(ctrl, n1, n2); > 1331: case Op_AddReductionVF: return new AddReductionVFNode(ctrl, n1, n2, is_associative); > 1332: case Op_AddReductionVD: return new AddReductionVDNode(ctrl, n1, n2, is_associative); Why do you only do it for the `F/D` `Add` instructions, but not the `Mul` instructions? Would those not equally profit from associativity? src/hotspot/share/opto/vectornode.hpp line 270: > 268: // when it is auto-vectorized as auto-vectorization mandates the operation to be > 269: // non-associative (strictly ordered). > 270: bool _is_associative; Could this be a `const`? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18034#pullrequestreview-1942840120 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528457622 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528461083 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528464022 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528468239 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528482036 From epeter at openjdk.org Mon Mar 18 12:59:33 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 18 Mar 2024 12:59:33 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 12:38:55 GMT, Emanuel Peter wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Naming changes: replace strict/non-strict with more technical terms > > src/hotspot/share/opto/loopopts.cpp line 4306: > >> 4304: } >> 4305: >> 4306: static bool is_unordered_reduction(Node* n) { > > I think you should rename all the "unordered" to "associative". > I implemented this, and I think it would now be nicer to make that change. You can do that also for the descriptions here, and the `move_unordered_reduction_out_of_loop` -> `move_associative_reduction_out_of_loop` > src/hotspot/share/opto/vectornode.cpp line 1332: > >> 1330: case Op_AddReductionVL: return new AddReductionVLNode(ctrl, n1, n2); >> 1331: case Op_AddReductionVF: return new AddReductionVFNode(ctrl, n1, n2, is_associative); >> 1332: case Op_AddReductionVD: return new AddReductionVDNode(ctrl, n1, n2, is_associative); > > Why do you only do it for the `F/D` `Add` instructions, but not the `Mul` instructions? Would those not equally profit from associativity? I'm not super familiar with the Vector API, but I could not see that MUL is not associative. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528462038 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528478789 From epeter at openjdk.org Mon Mar 18 13:30:35 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 18 Mar 2024 13:30:35 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v6] In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 11:23:56 GMT, Hamlin Li wrote: >> HI, >> Can you have a look at this patch adding some tests for Math.round instrinsics? >> Thanks! >> >> ### FYI: >> During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. >> It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. >> Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. > > Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Fix test failure in TestRoundVectorDoubleRandom.java > - Merge branch 'master' into round-v-exhaustive-tests > - add comments; refine code > - refine code; fix bug > - Merge branch 'master' into round-v-exhaustive-tests > - fix issue > - mv tests > - use IR framework to construct the random tests > - Initial commit Thanks for the changes, the comments help! You could still use full names, rather than `e` and `f`. Left a few comments. There are a few int/long issues like this: jshell> int x = 52; x ==> 52 jshell> long y = 1 << x; y ==> 1048576 jshell> long y = 1L << x y ==> 4503599627370496 I think I understand how you want to do the generation now. But as you see, there was a bug in it. And it was hard to find. That is why I was asking for just pure random value generation. Constructing values ourselves often goes wrong, and then the test-coverage drops quickly. Also: what is the probability that you will ever generate a `infty` or a `Nan` for doubles? Can you give me an estimate? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 57: > 55: public static void main(String args[]) { > 56: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.3", "-XX:MaxVectorSize=16"); > 57: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.3", "-XX:MaxVectorSize=32"); You should either drop the `"-XX:MaxVectorSize=16"`, or at least have a run without this flag. There are machines with higher max vector length, i.e. AVX512 have `64`. Would be nice to test those too ;) test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 93: > 91: > 92: int errn = 0; > 93: // a double precise float point is composed of 3 parts: sign/e(exponent)/f(signicant) Suggestion: // a double precise float point is composed of 3 parts: sign/e(exponent)/f(signicand) significand != significant ;) test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 100: > 98: // f (significant) part of a float value > 99: final int fWidth = eShift; > 100: final long fBound = 1 << fWidth; Suggestion: final long fBound = 1L << fWidth; test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 107: > 105: int fidx = 0; > 106: for (; fidx < fWidth; fidx++) { > 107: fis[fidx] = 1 << fidx; Suggestion: fis[fidx] = 1L << fidx; One more here, please be careful with this! test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 112: > 110: fis[fidx] = rand.nextLong(fBound); > 111: } > 112: fis[rand.nextInt(fidx)] = 0; Suggestion: fis[rand.nextInt(fNum)] = 0; I know it is equivalent, but a bit clearer ;) test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 122: > 120: final int eStart = rand.nextInt(9); > 121: final int eStep = (1 << 3) + rand.nextInt(3); > 122: for (int ei = eStart; ei < eBound; ei += eStep) { Why modify the `eStart`, but not the `eBound`? Why not have both constant? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 130: > 128: input[eiIdx*2] = Double.longBitsToDouble(bits); > 129: // negative values > 130: bits = bits | (1 << 63); Suggestion: bits = bits | (1L << 63); And yet another of these test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 131: > 129: // negative values > 130: bits = bits | (1 << 63); > 131: input[eiIdx*2+1] = Double.longBitsToDouble(bits); Is this some sort of luck, or why is this always in bounds? ------------- PR Review: https://git.openjdk.org/jdk/pull/17753#pullrequestreview-1942898173 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528491128 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528503479 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528499484 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528504618 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528496276 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528507124 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528512615 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528518614 From epeter at openjdk.org Mon Mar 18 13:30:35 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 18 Mar 2024 13:30:35 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v6] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 13:07:21 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: >> >> - Fix test failure in TestRoundVectorDoubleRandom.java >> - Merge branch 'master' into round-v-exhaustive-tests >> - add comments; refine code >> - refine code; fix bug >> - Merge branch 'master' into round-v-exhaustive-tests >> - fix issue >> - mv tests >> - use IR framework to construct the random tests >> - Initial commit > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 100: > >> 98: // f (significant) part of a float value >> 99: final int fWidth = eShift; >> 100: final long fBound = 1 << fWidth; > > Suggestion: > > final long fBound = 1L << fWidth; There may be others like it! > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 122: > >> 120: final int eStart = rand.nextInt(9); >> 121: final int eStep = (1 << 3) + rand.nextInt(3); >> 122: for (int ei = eStart; ei < eBound; ei += eStep) { > > Why modify the `eStart`, but not the `eBound`? Why not have both constant? Ah, you just want to step through with some random start-offset, and some random stride. I see. `eStart = 0...8` `eStep = 8...10` Fair enough. Maybe add a comment for this. > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 130: > >> 128: input[eiIdx*2] = Double.longBitsToDouble(bits); >> 129: // negative values >> 130: bits = bits | (1 << 63); > > Suggestion: > > bits = bits | (1L << 63); > > And yet another of these jshell> 1 << 63 $4 ==> -2147483648 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528500677 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528510828 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1528513243 From duke at openjdk.org Mon Mar 18 13:46:40 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Mon, 18 Mar 2024 13:46:40 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v6] In-Reply-To: References: Message-ID: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ArsenyBochkarev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: - Merge master - Re-selected register for tmp's in kernel_crc32 - Use shNadd in update_byte_crc32 - Optimize by1_loop - Fix unroll size - Rename constants - Partially unroll loop - Optimize loop counter in L_by16_loop - Use MacroAssembler::lwu instead of Assembler::lwu - Save instruction when getting table3 address - ... and 7 more: https://git.openjdk.org/jdk/compare/fb390d20...f5e2c52e ------------- Changes: https://git.openjdk.org/jdk/pull/17046/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=05 Stats: 552 lines in 8 files changed: 547 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From epeter at openjdk.org Mon Mar 18 13:50:30 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 18 Mar 2024 13:50:30 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v7] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 03:26:12 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Move logic to CMoveNode::Ideal and improve IR test Nice work, we are really getting there! Yes, let's not add any other things in to this RFE, let's rather have follow-up RFE's. One request: add a vectorization-IR-test that shows off that we now use Min/Max vectors, rather than CMove vectors. src/hotspot/share/opto/movenode.cpp line 213: > 211: > 212: // Ensure comparison is an integral type, and that the cmove is of the same type. > 213: if ((cmp_op != Op_CmpI || cmove_op != Op_CMoveI) && (cmp_op != Op_CmpL || cmove_op != Op_CMoveL)) { What if we combine a `CmpI` with a `CMoveL`? Or maybe there is some strange way to use a `Float.floatToIntBits` and combine a `CmpI` with a `CMoveF`? Ah, wait. I think it is correct. It's just difficult to read these "inverted" formulas. I suggest you rewrite it to be: `! (both-int or both-long)` test/hotspot/jtreg/compiler/c2/irTests/TestIfMinMax.java line 36: > 34: * @summary Test that if expressions are properly folded into min/max nodes > 35: * @library /test/lib / > 36: * @run driver compiler.c2.irTests.TestIfMinMax Suggestion: * @run main compiler.c2.irTests.TestIfMinMax Turns out otherwise outside flags cannot get passed it. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17574#pullrequestreview-1943038352 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1528563649 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1528569046 From duke at openjdk.org Mon Mar 18 13:59:40 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Mon, 18 Mar 2024 13:59:40 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v7] In-Reply-To: References: Message-ID: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ArsenyBochkarev has updated the pull request incrementally with one additional commit since the last revision: Move MacroAssembler methods out of COMPILER2 macro ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17046/files - new: https://git.openjdk.org/jdk/pull/17046/files/f5e2c52e..857dc20d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=05-06 Stats: 366 lines in 2 files changed: 183 ins; 183 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From duke at openjdk.org Mon Mar 18 13:59:41 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Mon, 18 Mar 2024 13:59:41 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v5] In-Reply-To: <1ciFje_z8G3aP849ZeIfhtu890U1omYhobbRaxuQ3HI=.a000a953-c7f0-4604-8ffc-537af4c58328@github.com> References: <1ciFje_z8G3aP849ZeIfhtu890U1omYhobbRaxuQ3HI=.a000a953-c7f0-4604-8ffc-537af4c58328@github.com> Message-ID: On Wed, 28 Feb 2024 07:05:29 GMT, urniming wrote: >> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: >> >> - Re-selected register for tmp's in kernel_crc32 >> - Use shNadd in update_byte_crc32 >> - Optimize by1_loop > > src/hotspot/cpu/riscv/macroAssembler_riscv.hpp line 1247: > >> 1245: bool upper); >> 1246: void update_byte_crc32(Register crc, Register val, Register table); >> 1247: > > Hi, one thing I'm confused about here, why do these three functions need to be in the `COMPILER2` macro ? Hello! Thanks for pointing this out, I moved these functions up upwards before `#ifdef COMPILER2`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1528615445 From smonteith at openjdk.org Mon Mar 18 14:40:27 2024 From: smonteith at openjdk.org (Stuart Monteith) Date: Mon, 18 Mar 2024 14:40:27 GMT Subject: RFR: 8326541: [AArch64] ZGC C2 load barrier stub considers the length of live registers when spilling registers [v3] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 03:11:05 GMT, Joshua Zhu wrote: >> Currently ZGC C2 load barrier stub saves the whole live register regardless of what size of register is live on aarch64. >> Considering the size of SVE register is an implementation-defined multiple of 128 bits, up to 2048 bits, >> even the use of a floating point may cause the maximum 2048 bits stack occupied. >> Hence I would like to introduce this change on aarch64: take the length of live registers into consideration in ZGC C2 load barrier stub. >> >> In a floating point case on 2048 bits SVE machine, the following ZLoadBarrierStubC2 >> >> >> ...... >> 0x0000ffff684cfad8: stp x15, x18, [sp, #80] >> 0x0000ffff684cfadc: sub sp, sp, #0x100 >> 0x0000ffff684cfae0: str z16, [sp] >> 0x0000ffff684cfae4: add x1, x13, #0x10 >> 0x0000ffff684cfae8: mov x0, x16 >> ;; 0xFFFF803F5414 >> 0x0000ffff684cfaec: mov x8, #0x5414 // #21524 >> 0x0000ffff684cfaf0: movk x8, #0x803f, lsl #16 >> 0x0000ffff684cfaf4: movk x8, #0xffff, lsl #32 >> 0x0000ffff684cfaf8: blr x8 >> 0x0000ffff684cfafc: mov x16, x0 >> 0x0000ffff684cfb00: ldr z16, [sp] >> 0x0000ffff684cfb04: add sp, sp, #0x100 >> 0x0000ffff684cfb08: ptrue p7.b >> 0x0000ffff684cfb0c: ldp x4, x5, [sp, #16] >> ...... >> >> >> could be optimized into: >> >> >> ...... >> 0x0000ffff684cfa50: stp x15, x18, [sp, #80] >> 0x0000ffff684cfa54: str d16, [sp, #-16]! // extra 8 bytes to align 16 bytes in push_fp() >> 0x0000ffff684cfa58: add x1, x13, #0x10 >> 0x0000ffff684cfa5c: mov x0, x16 >> ;; 0xFFFF7FA942A8 >> 0x0000ffff684cfa60: mov x8, #0x42a8 // #17064 >> 0x0000ffff684cfa64: movk x8, #0x7fa9, lsl #16 >> 0x0000ffff684cfa68: movk x8, #0xffff, lsl #32 >> 0x0000ffff684cfa6c: blr x8 >> 0x0000ffff684cfa70: mov x16, x0 >> 0x0000ffff684cfa74: ldr d16, [sp], #16 >> 0x0000ffff684cfa78: ptrue p7.b >> 0x0000ffff684cfa7c: ldp x4, x5, [sp, #16] >> ...... >> >> >> Besides the above benefit, when we know what size of register is live, >> we could remove the unnecessary caller save in ZGC C2 load barrier stub when we meet C-ABI SOE fp registers. >> >> Passed jtreg with option "-XX:+UseZGC -XX:+ZGenerational" with no failures introduced. > > Joshua Zhu has updated the pull request incrementally with one additional commit since the last revision: > > change jtreg test case name test/hotspot/jtreg/gc/z/TestRegistersPushPopAtZGCLoadBarrierStub.java line 291: > 289: String keyString = keyword + expected_number_of_push_pop_at_load_barrier_fregs + " " + expected_freg_type + " registers"; > 290: if (!containOnlyOneOccuranceOfKeyword(stdout, keyString)) { > 291: throw new RuntimeException("Stdout is expected to contain only one occurance of keyString: " + "'" + keyString + "'"); In the event of failure, would it be possible to print the erroneous output? The output from the subprocesses, being directly piped in, doesn't lend itself to easy debugging. At first I thought there might be an option that could alter OutputAnalyzers output, but sadly not. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17977#discussion_r1528691697 From roland at openjdk.org Mon Mar 18 16:47:31 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Mar 2024 16:47:31 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: References: Message-ID: <3RdHpTSFVw-IK_d3gxsvdHBRwjg31UVxc-x02C4wYcU=.f6a6a9c0-64a8-4732-a945-72d74e0b049a@github.com> On Sat, 16 Mar 2024 09:39:33 GMT, Andrey Turbanov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test > > test/hotspot/jtreg/compiler/c2/TestFoldIfRemovesTopNode.java line 64: > >> 62: if (flag) { >> 63: k = new int[k].length; >> 64: int j = k + 3; > > Suggestion: > > int j = k + 3; Thanks. Will fix this before pushing. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18305#discussion_r1528908458 From jkarthikeyan at openjdk.org Mon Mar 18 16:47:33 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 18 Mar 2024 16:47:33 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v7] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 13:40:16 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Move logic to CMoveNode::Ideal and improve IR test > > src/hotspot/share/opto/movenode.cpp line 213: > >> 211: >> 212: // Ensure comparison is an integral type, and that the cmove is of the same type. >> 213: if ((cmp_op != Op_CmpI || cmove_op != Op_CMoveI) && (cmp_op != Op_CmpL || cmove_op != Op_CMoveL)) { > > What if we combine a `CmpI` with a `CMoveL`? > Or maybe there is some strange way to use a `Float.floatToIntBits` and combine a `CmpI` with a `CMoveF`? > > Ah, wait. I think it is correct. It's just difficult to read these "inverted" formulas. > I suggest you rewrite it to be: > `! (both-int or both-long)` Ah yep, I had originally wrote it that way but I factored out the `!`. I agree that it would be cleaner that way, though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1528908129 From roland at openjdk.org Mon Mar 18 16:56:00 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Mar 2024 16:56:00 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v3] In-Reply-To: References: Message-ID: > In `IfNode::fold_compares_helper()`, `adjusted_val` is: > > > (SubI (AddI top constant) 0) > > > which is then transformed to the `top` node. The code, next, tries to > destroy the `adjusted_val` node i.e. the `top` node. That results in > the assert failure. Given We're trying to fold 2 ifs in a dying part > of the graph, the fix is straightforward: test `adjusted_val` for top > and bail out from the transformation if that's the case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18305/files - new: https://git.openjdk.org/jdk/pull/18305/files/f4703ea7..43a1d337 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18305&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18305&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18305.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18305/head:pull/18305 PR: https://git.openjdk.org/jdk/pull/18305 From dfenacci at openjdk.org Mon Mar 18 16:58:28 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 18 Mar 2024 16:58:28 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 10:44:54 GMT, Yudi Zheng wrote: > Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. Quite a simplification! Have you checked if there are any performance differences? ------------- PR Review: https://git.openjdk.org/jdk/pull/18226#pullrequestreview-1943670073 From sviswanathan at openjdk.org Mon Mar 18 17:03:26 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 18 Mar 2024 17:03:26 GMT Subject: RFR: 8328309: Remove malformed masked shift instruction selection patterns In-Reply-To: References: Message-ID: On Sat, 16 Mar 2024 18:53:20 GMT, Jatin Bhateja wrote: > - This bug fix patch removes existing masked logical right, logical left and arithmetic right shift memory operand patterns which do not take into account shift count rounding bitwise AND operation, this limits their applicability to generic cases. > - [JDK-8319889](https://bugs.openjdk.org/browse/JDK-8319889) also reported unhandled operation assertion failure seen with some shift operation test points in Vector API JTREG tests along with -XX:+StressIncrementalInlining flag. > > Best Regards, > Jatin Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18338#pullrequestreview-1943687417 From aph at openjdk.org Mon Mar 18 17:18:28 2024 From: aph at openjdk.org (Andrew Haley) Date: Mon, 18 Mar 2024 17:18:28 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 11:03:08 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Naming changes: replace strict/non-strict with more technical terms Please rename this to "8320725: AArch64: C2: Add "is_associative" flag for floating-point add-reduction" ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-2004498959 From roland at openjdk.org Mon Mar 18 17:19:36 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Mar 2024 17:19:36 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" Message-ID: The assert fails because peeling happens at a single entry `Region`. That `Region` only has a single input because other inputs were found unreachable and removed by `PhaseIdealLoop::Dominators()`. The fix I propose is to have `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s entirely in this case. ------------- Commit messages: - fix & test Changes: https://git.openjdk.org/jdk/pull/18353/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18353&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321278 Stats: 79 lines in 2 files changed: 78 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18353.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18353/head:pull/18353 PR: https://git.openjdk.org/jdk/pull/18353 From jbhateja at openjdk.org Mon Mar 18 17:24:34 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 18 Mar 2024 17:24:34 GMT Subject: Integrated: 8328309: Remove malformed masked shift instruction selection patterns In-Reply-To: References: Message-ID: On Sat, 16 Mar 2024 18:53:20 GMT, Jatin Bhateja wrote: > - This bug fix patch removes existing masked logical right, logical left and arithmetic right shift memory operand patterns which do not take into account shift count rounding bitwise AND operation, this limits their applicability to generic cases. > - [JDK-8319889](https://bugs.openjdk.org/browse/JDK-8319889) also reported unhandled operation assertion failure seen with some shift operation test points in Vector API JTREG tests along with -XX:+StressIncrementalInlining flag. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 9e32db26 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/9e32db266e4c3cc9be273fa6b77112832a43ba4a Stats: 45 lines in 1 file changed: 0 ins; 45 del; 0 mod 8328309: Remove malformed masked shift instruction selection patterns Reviewed-by: sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/18338 From roland at openjdk.org Mon Mar 18 17:25:30 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Mar 2024 17:25:30 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: <_Qm0japYCaj72QGczOLYyKgmkaAA4P5AhG6QPfmd3Ys=.d0e57b13-4cb2-4f05-b902-e655d7f2a123@github.com> References: <_Qm0japYCaj72QGczOLYyKgmkaAA4P5AhG6QPfmd3Ys=.d0e57b13-4cb2-4f05-b902-e655d7f2a123@github.com> Message-ID: On Tue, 12 Mar 2024 17:42:09 GMT, Emanuel Peter wrote: > Looks reasonable, but these ad-hoc CastII also make me nervous. I agree with that. Still feels like the most reasonable fix for this particular issue. > test/hotspot/jtreg/compiler/longcountedloops/TestInaccurateInnerLoopLimit.java line 40: > >> 38: >> 39: public static void test() { >> 40: for (long i = 9223372034707292164L; i > 9223372034707292158L; i += -2L) { } > > I'm always amazed at how such simple tests can fail. Is there any way we can improve the test coverage for Long loops? Fuzzer test cases that call `Objects.checkIndex()` I suppose would possibly catch bugs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17965#issuecomment-2004514402 PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1528969620 From roland at openjdk.org Mon Mar 18 17:25:31 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Mar 2024 17:25:31 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: <_Qm0japYCaj72QGczOLYyKgmkaAA4P5AhG6QPfmd3Ys=.d0e57b13-4cb2-4f05-b902-e655d7f2a123@github.com> Message-ID: On Mon, 18 Mar 2024 17:22:45 GMT, Roland Westrelin wrote: >> Looks reasonable, but these ad-hoc CastII also make me nervous. >> >> What worries me with adding such "Ad-Hoc" CastII nodes is that elsewhere a very similar computation may not have the same tight type. And then you have a tight type somewhere, and a loose type elsewhere. This is how we get the data-flow collapsing and the cfg not folding. > >> Looks reasonable, but these ad-hoc CastII also make me nervous. > > I agree with that. Still feels like the most reasonable fix for this particular issue. > @rwestrel please wait for our testing to complete, I just launched it. Thanks for running it. Any update on testing? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17965#issuecomment-2004515234 From aph at openjdk.org Mon Mar 18 17:28:31 2024 From: aph at openjdk.org (Andrew Haley) Date: Mon, 18 Mar 2024 17:28:31 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 11:03:08 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Naming changes: replace strict/non-strict with more technical terms src/hotspot/cpu/aarch64/aarch64_vector.ad line 2858: > 2856: // reduction addF > 2857: instruct reduce_add2F_neon(vRegF dst, vRegF fsrc, vReg vsrc) %{ > 2858: predicate(Matcher::vector_length(n->in(2)) == 2 && n->as_Reduction()->is_associative()); This `vector_length(n->in(2)) == 2` is very obscure. I suspect that anyone coming across this code would not understand it. What exactly is the reason that this pattern is only applied for the 16b case? You need to give a justification in a comment right here. src/hotspot/share/opto/vectornode.hpp line 235: > 233: // Floating-point addition and multiplication are non-associative, so > 234: // AddReductionVF/D and MulReductionVF/D require strict-ordering > 235: // in auto-vectorization. Currently, Vector API allows Don't say "currently". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528972669 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528974623 From aph at openjdk.org Mon Mar 18 17:32:35 2024 From: aph at openjdk.org (Andrew Haley) Date: Mon, 18 Mar 2024 17:32:35 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: <6KLRg7UgrEMNOU71aVTF1Pka972NReqA_wOynzEipHE=.f3e38aea-d0c0-4f09-8849-96398152ef6a@github.com> On Fri, 15 Mar 2024 11:03:08 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Naming changes: replace strict/non-strict with more technical terms src/hotspot/share/opto/vectornode.hpp line 240: > 238: // > 239: // Other reductions are associative (do not need strict ordering). > 240: virtual bool is_associative() const { I think this flag may be badly named. The idea you want to express is not so much associativity, but whether such nodes should be treated as strictly ordered. It would be much less confusing to pick a name like ordered() because that describes what you want to the node to do. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1528979851 From rriggs at openjdk.org Mon Mar 18 17:35:27 2024 From: rriggs at openjdk.org (Roger Riggs) Date: Mon, 18 Mar 2024 17:35:27 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 10:44:54 GMT, Yudi Zheng wrote: > Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. src/java.base/share/classes/java/math/BigInteger.java line 1836: > 1834: > 1835: if (z == null || z.length < (xlen+ ylen)) > 1836: z = new int[xlen+ylen]; Spaces before and after "+" please. Tnx ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18226#discussion_r1528984661 From dlong at openjdk.org Mon Mar 18 19:08:27 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 18 Mar 2024 19:08:27 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 10:44:54 GMT, Yudi Zheng wrote: > Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. src/hotspot/share/opto/library_call.cpp line 5934: > 5932: // 'y_start' points to y array + scaled ylen > 5933: > 5934: Node* zlen = _gvn.transform(new AddINode(xlen, ylen)); Would could generate one less instruction in the code cache if we did this `add` in the native runtime function. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18226#discussion_r1529102070 From vlivanov at openjdk.org Mon Mar 18 19:50:26 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 18 Mar 2024 19:50:26 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining [v2] In-Reply-To: References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: On Sat, 16 Mar 2024 22:19:10 GMT, Jatin Bhateja wrote: >> This bug fix patch fixes a crash occurring due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. >> >> >> ByteVector sliceTemplate(int origin, Vector v1) { >> ByteVector that = (ByteVector) v1; >> that.check(this); >> Objects.checkIndex(origin, length() + 1); >> VectorShuffle iota = iotaShuffle(); >> VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] >> iota = iotaShuffle(origin, 1, true); [B] >> return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] >> } >> >> >> >> Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. >> >> PredictedCallGenerators (Vector.compare) >> PredictedCallGenerators (Byte256Vector.compare) >> ParseGenerator (Byte256Vector.compare) [D] >> UncommonTrap (receiver other than Byte256Vector) >> PredictedCallGenerators (Byte128Vector.compare) >> ParseGenerator (Byte128Vector.compare) [E] >> UncommonTrap (receiver other than Byte128Vector) [F] >> PredictedCallGenerators (UncommonTrap) >> [converged state] = Merge JVM State orginating from C and E [G] >> >> Since top level receiver of sliceTemplate is Byte256Vector hence while executing the call generator for Byte128Vector.compare (see code at line E) compiler observes a mismatch b/w incoming argument species i.e. one argument is a 256 bit vector while other is 128 bit vector and throws an exception. >> >> At state convergence point (see code at line G), since one of the c... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Restricting the patch to only bi-morphic inlining crash fix. Marked as reviewed by vlivanov (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18282#pullrequestreview-1944043145 From duke at openjdk.org Mon Mar 18 20:02:53 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Mon, 18 Mar 2024 20:02:53 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v8] In-Reply-To: References: Message-ID: <-9mGlu8beCdbADXVweUcUORyU5qkEfqa1FAtmeEqDPo=.0e3fb33a-39c6-4464-abf4-0c8594e48247@github.com> > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ArsenyBochkarev has updated the pull request incrementally with one additional commit since the last revision: Move generate_updateBytesCRC32 out of COMPILER2 macro ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17046/files - new: https://git.openjdk.org/jdk/pull/17046/files/857dc20d..d640af0e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=06-07 Stats: 4 lines in 1 file changed: 2 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From sviswanathan at openjdk.org Mon Mar 18 21:26:20 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 18 Mar 2024 21:26:20 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining [v2] In-Reply-To: References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: On Sat, 16 Mar 2024 22:19:10 GMT, Jatin Bhateja wrote: >> This bug fix patch fixes a crash occurring due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. >> >> >> ByteVector sliceTemplate(int origin, Vector v1) { >> ByteVector that = (ByteVector) v1; >> that.check(this); >> Objects.checkIndex(origin, length() + 1); >> VectorShuffle iota = iotaShuffle(); >> VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] >> iota = iotaShuffle(origin, 1, true); [B] >> return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] >> } >> >> >> >> Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. >> >> PredictedCallGenerators (Vector.compare) >> PredictedCallGenerators (Byte256Vector.compare) >> ParseGenerator (Byte256Vector.compare) [D] >> UncommonTrap (receiver other than Byte256Vector) >> PredictedCallGenerators (Byte128Vector.compare) >> ParseGenerator (Byte128Vector.compare) [E] >> UncommonTrap (receiver other than Byte128Vector) [F] >> PredictedCallGenerators (UncommonTrap) >> [converged state] = Merge JVM State orginating from C and E [G] >> >> Since top level receiver of sliceTemplate is Byte256Vector hence while executing the call generator for Byte128Vector.compare (see code at line E) compiler observes a mismatch b/w incoming argument species i.e. one argument is a 256 bit vector while other is 128 bit vector and throws an exception. >> >> At state convergence point (see code at line G), since one of the c... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Restricting the patch to only bi-morphic inlining crash fix. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18282#pullrequestreview-1944229136 From jbhateja at openjdk.org Tue Mar 19 01:16:23 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 19 Mar 2024 01:16:23 GMT Subject: Integrated: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining In-Reply-To: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: On Wed, 13 Mar 2024 17:19:40 GMT, Jatin Bhateja wrote: > This bug fix patch fixes a crash occurring due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. > > > ByteVector sliceTemplate(int origin, Vector v1) { > ByteVector that = (ByteVector) v1; > that.check(this); > Objects.checkIndex(origin, length() + 1); > VectorShuffle iota = iotaShuffle(); > VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] > iota = iotaShuffle(origin, 1, true); [B] > return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] > } > > > > Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. > > PredictedCallGenerators (Vector.compare) > PredictedCallGenerators (Byte256Vector.compare) > ParseGenerator (Byte256Vector.compare) [D] > UncommonTrap (receiver other than Byte256Vector) > PredictedCallGenerators (Byte128Vector.compare) > ParseGenerator (Byte128Vector.compare) [E] > UncommonTrap (receiver other than Byte128Vector) [F] > PredictedCallGenerators (UncommonTrap) > [converged state] = Merge JVM State orginating from C and E [G] > > Since top level receiver of sliceTemplate is Byte256Vector hence while executing the call generator for Byte128Vector.compare (see code at line E) compiler observes a mismatch b/w incoming argument species i.e. one argument is a 256 bit vector while other is 128 bit vector and throws an exception. > > At state convergence point (see code at line G), since one of the control path resulted into an exception, compiler propagates ... This pull request has now been integrated. Changeset: 2dd5fba3 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/2dd5fba3bd37c577b8442b67a67dbcb22b9a530e Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining Reviewed-by: vlivanov, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/18282 From gcao at openjdk.org Tue Mar 19 04:00:15 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 19 Mar 2024 04:00:15 GMT Subject: RFR: 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals Message-ID: Hi, The current behavior of C2_MacroAssembler::arrays_equals always load longword before comparison. When array[0] is aligned to 32-bit (especially after JDK-8139457 which tries to relax alignment of array elements), the last longword load will exceed the array limit and may touch the next word beyond object layout in heap memory. So this should bear a similar problem as JDK-8328138. Proposed fix changes this behavior and aligns with handling in C2_MacroAssembler::string_equals, which will check the number of remaining array elements before loading the next longword. No obvious changes witnessed from the JMH numbers or benchmarks like SPECjbb2015. Patch also removed the AvoidUnalignedAccesses check in C2_MacroAssembler::string_equals as we don't see extra performance gain when setting AvoidUnalignedAccesses to false when testing the JMH tests or benchmarks like SPECjbb2015 on three popular RISC-V hardware platforms. We can consider adding it back if it turns out to be usefull on future new hardwares. ### Correctness test: - [x] Run tier1-3, hotspot:tier4 tests on LicheePi 4A (release) - [x] Run tier1-3, hotspot:tier4 tests on SOPHON SG2042 (release) ### JMH test: #### 1. test/micro/org/openjdk/bench/java/util/ArraysEquals.java 1. SiFive unmatched Before: Benchmark Mode Cnt Score Error Units ArraysEquals.testByteFalseBeginning avgt 12 37.804 ? 7.292 ns/op ArraysEquals.testByteFalseEnd avgt 12 77.972 ? 3.208 ns/op ArraysEquals.testByteFalseMid avgt 12 54.427 ? 6.436 ns/op ArraysEquals.testByteTrue avgt 12 75.121 ? 5.172 ns/op ArraysEquals.testCharFalseBeginning avgt 12 42.486 ? 6.526 ns/op ArraysEquals.testCharFalseEnd avgt 12 122.208 ? 2.533 ns/op ArraysEquals.testCharFalseMid avgt 12 83.891 ? 3.680 ns/op ArraysEquals.testCharTrue avgt 12 122.096 ? 5.519 ns/op After: Benchmark Mode Cnt Score Error Units ArraysEquals.testByteFalseBeginning avgt 12 32.638 ? 7.279 ns/op ArraysEquals.testByteFalseEnd avgt 12 73.013 ? 8.081 ns/op ArraysEquals.testByteFalseMid avgt 12 43.619 ? 6.104 ns/op ArraysEquals.testByteTrue avgt 12 83.044 ? 8.207 ns/op ArraysEquals.testCharFalseBeginning avgt 12 39.154 ? 5.233 ns/op ArraysEquals.testCharFalseEnd avgt 12 122.072 ? 7.784 ns/op ArraysEquals.testCharFalseMid avgt 12 67.831 ? 9.218 ns/op ArraysEquals.testCharTrue avgt 12 129.873 ? 7.910 ns/op 2. LicheePi 4A Before: Benchmark Mode Cnt Score Error Units ArraysEquals.testByteFalseBeginning avgt 12 24.198 ? 0.361 ns/op ArraysEquals.testByteFalseEnd avgt 12 35.890 ? 4.388 ns/op ArraysEquals.testByteFalseMid avgt 12 27.881 ? 0.828 ns/op ArraysEquals.testByteTrue avgt 12 32.596 ? 1.529 ns/op ArraysEquals.testCharFalseBeginning avgt 12 27.159 ? 1.878 ns/op ArraysEquals.testCharFalseEnd avgt 12 66.668 ? 0.476 ns/op ArraysEquals.testCharFalseMid avgt 12 32.748 ? 0.029 ns/op ArraysEquals.testCharTrue avgt 12 66.951 ? 0.620 ns/op After: Benchmark Mode Cnt Score Error Units ArraysEquals.testByteFalseBeginning avgt 12 22.394 ? 0.073 ns/op ArraysEquals.testByteFalseEnd avgt 12 31.721 ? 0.052 ns/op ArraysEquals.testByteFalseMid avgt 12 26.455 ? 0.027 ns/op ArraysEquals.testByteTrue avgt 12 31.941 ? 0.040 ns/op ArraysEquals.testCharFalseBeginning avgt 12 23.991 ? 0.031 ns/op ArraysEquals.testCharFalseEnd avgt 12 53.336 ? 0.082 ns/op ArraysEquals.testCharFalseMid avgt 12 31.744 ? 0.060 ns/op ArraysEquals.testCharTrue avgt 12 60.493 ? 0.760 ns/op 3. SOPHON SG2042 Before: Benchmark Mode Cnt Score Error Units ArraysEquals.testByteFalseBeginning avgt 12 22.569 ? 0.025 ns/op ArraysEquals.testByteFalseEnd avgt 12 29.629 ? 0.182 ns/op ArraysEquals.testByteFalseMid avgt 12 25.582 ? 0.024 ns/op ArraysEquals.testByteTrue avgt 12 29.587 ? 0.025 ns/op ArraysEquals.testCharFalseBeginning avgt 12 23.609 ? 0.181 ns/op ArraysEquals.testCharFalseEnd avgt 12 62.853 ? 0.247 ns/op ArraysEquals.testCharFalseMid avgt 12 30.623 ? 0.185 ns/op ArraysEquals.testCharTrue avgt 12 61.622 ? 0.856 ns/op After: Benchmark Mode Cnt Score Error Units ArraysEquals.testByteFalseBeginning avgt 12 20.569 ? 0.018 ns/op ArraysEquals.testByteFalseEnd avgt 12 29.646 ? 0.183 ns/op ArraysEquals.testByteFalseMid avgt 12 24.832 ? 0.265 ns/op ArraysEquals.testByteTrue avgt 12 29.682 ? 0.265 ns/op ArraysEquals.testCharFalseBeginning avgt 12 22.663 ? 0.335 ns/op ArraysEquals.testCharFalseEnd avgt 12 53.716 ? 0.197 ns/op ArraysEquals.testCharFalseMid avgt 12 29.585 ? 0.021 ns/op ArraysEquals.testCharTrue avgt 12 59.231 ? 0.213 ns/op #### test/micro/org/openjdk/bench/java/lang/StringEquals.java 1. Sifive unmatched Before: Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 15 40.143 ? 3.819 ns/op StringEquals.almostEqualUTF16 avgt 15 40.154 ? 3.903 ns/op StringEquals.different avgt 15 29.653 ? 4.452 ns/op StringEquals.differentCoders avgt 15 19.452 ? 4.964 ns/op StringEquals.equal avgt 15 41.975 ? 3.997 ns/op StringEquals.equalsUTF16 avgt 15 43.959 ? 2.417 ns/op After: Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 15 40.542 ? 4.384 ns/op StringEquals.almostEqualUTF16 avgt 15 40.140 ? 4.947 ns/op StringEquals.different avgt 15 24.935 ? 4.487 ns/op StringEquals.differentCoders avgt 15 20.186 ? 5.149 ns/op StringEquals.equal avgt 15 38.246 ? 4.405 ns/op StringEquals.equalsUTF16 avgt 15 36.506 ? 4.278 ns/op 2. LicheePi 4A Before: Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 15 26.797 ? 0.070 ns/op StringEquals.almostEqualUTF16 avgt 15 26.796 ? 0.039 ns/op StringEquals.different avgt 15 23.521 ? 0.050 ns/op StringEquals.differentCoders avgt 15 20.237 ? 0.043 ns/op StringEquals.equal avgt 15 31.150 ? 0.512 ns/op StringEquals.equalsUTF16 avgt 15 30.699 ? 0.133 ns/op After Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 15 26.433 ? 0.065 ns/op StringEquals.almostEqualUTF16 avgt 15 26.580 ? 0.299 ns/op StringEquals.different avgt 15 23.501 ? 0.042 ns/op StringEquals.differentCoders avgt 15 20.243 ? 0.043 ns/op StringEquals.equal avgt 15 31.944 ? 0.816 ns/op StringEquals.equalsUTF16 avgt 15 32.699 ? 0.466 ns/op 3. SOPHON SG2042 Before: Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 15 25.279 ? 0.260 ns/op StringEquals.almostEqualUTF16 avgt 15 25.623 ? 0.060 ns/op StringEquals.different avgt 15 22.267 ? 0.510 ns/op StringEquals.differentCoders avgt 15 19.072 ? 0.029 ns/op StringEquals.equal avgt 15 30.157 ? 0.292 ns/op StringEquals.equalsUTF16 avgt 15 30.152 ? 0.293 ns/op After: Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 15 24.956 ? 0.261 ns/op StringEquals.almostEqualUTF16 avgt 15 25.288 ? 0.074 ns/op StringEquals.different avgt 15 22.588 ? 0.050 ns/op StringEquals.differentCoders avgt 15 19.109 ? 0.150 ns/op StringEquals.equal avgt 15 26.108 ? 0.047 ns/op StringEquals.equalsUTF16 avgt 15 25.940 ? 0.263 ns/op ------------- Commit messages: - 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals Changes: https://git.openjdk.org/jdk/pull/18370/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18370&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328404 Stats: 159 lines in 3 files changed: 42 ins; 59 del; 58 mod Patch: https://git.openjdk.org/jdk/pull/18370.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18370/head:pull/18370 PR: https://git.openjdk.org/jdk/pull/18370 From jkarthikeyan at openjdk.org Tue Mar 19 05:11:21 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 19 Mar 2024 05:11:21 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: <_Yu40XYH42rezeLmpW4jc39aKsOJVnON-dHtcTQKNKU=.9dbec056-cf53-44dd-916b-0b2dd30d0766@github.com> On Wed, 13 Mar 2024 02:05:39 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update the package name for tests Ah, I had assumed the transformation was valid beforehand, since it had existed for a while ? The issue only impacts `-1`, right? Since the comparison should succeed for both `m >=0` and `m < -1`. I think it would be good to address it in this patch, as it's refactoring the existing code. [The original patch](https://github.com/openjdk/jdk/commit/415eda1274fce4ddc78dd2221abe2ce61f7ab7f2) seems to primarily test with `array.length` as the `m` value, so the value set was nonnegative. I think we can limit the `((x & m) u< m + 1)` transform to cases where `m` is known to be nonnegative and maintain the intent behind the transform. Something like: -} else if (_test._test == BoolTest::lt && cmp2->Opcode() == Op_AddI && cmp2->in(2)->find_int_con(0) == 1) { +} else if (_test._test == BoolTest::lt && cmp2->Opcode() == Op_AddI && cmp2->in(2)->find_int_con(0) == 1 && phase->type(cmp2->in(1))->is_int()->_lo >= 0) { With the IR test being modified accordingly. It'd also be good to write an IR test method that verifies that the transform doesn't take place if `m` doesn't succeed the `_lo >= 0` test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2005774049 From epeter at openjdk.org Tue Mar 19 06:51:20 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Mar 2024 06:51:20 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: <_Qm0japYCaj72QGczOLYyKgmkaAA4P5AhG6QPfmd3Ys=.d0e57b13-4cb2-4f05-b902-e655d7f2a123@github.com> Message-ID: On Mon, 18 Mar 2024 17:23:10 GMT, Roland Westrelin wrote: >>> Looks reasonable, but these ad-hoc CastII also make me nervous. >> >> I agree with that. Still feels like the most reasonable fix for this particular issue. > >> @rwestrel please wait for our testing to complete, I just launched it. > > Thanks for running it. Any update on testing? @rwestrel Yes, the tests are passing! Ship it! ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17965#issuecomment-2005971158 From roland at openjdk.org Tue Mar 19 07:59:26 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 19 Mar 2024 07:59:26 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: <5rdccw6S4Vh366jUBchLrKrtNUKjJLmwQ38iPEWuBYg=.0d018d93-a095-4971-9459-391a86827b6c@github.com> On Wed, 6 Mar 2024 14:50:30 GMT, Christian Hagedorn wrote: >> Long counted loop are transformed into a loop nest of 2 "regular" >> loops and in a subsequent loop opts round, the inner loop is >> transformed into a counted loop. The limit for the inner loop is set, >> when the loop nest is created, so it's expected there's no need for a >> loop limit check when the counted loop is created. The assert fires >> because, when the counted loop is created, it is found that it needs a >> loop limit check. The reason for that is that the limit is >> transformed, between nest creation and counted loop creation, in a way >> that the range of values of the inner loop's limit becomes >> unknown. The limit when the nest is created is: >> >> >> 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 >> 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) >> 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] >> 122 ConvL2I === _ 112 [[ ]] #int >> >> >> The type of 122 is `2..6` but it is then transformed to: >> >> >> 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) >> 191 ConvL2I === _ 106 [[ 196 ]] #int >> 195 ConI === 0 [[ 196 ]] #int:max-1 >> 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] >> >> >> That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI >> (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of >> values returns TypeInt::INT and the bounds of the limit are lost. I >> propose adding a `CastII` after the `ConvL2I` so the range of values >> of the limit doesn't get lost. > > Marked as reviewed by chagedorn (Reviewer). @chhagedorn @eme64 thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17965#issuecomment-2006179280 From roland at openjdk.org Tue Mar 19 07:59:27 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 19 Mar 2024 07:59:27 GMT Subject: Integrated: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Thu, 22 Feb 2024 14:36:52 GMT, Roland Westrelin wrote: > Long counted loop are transformed into a loop nest of 2 "regular" > loops and in a subsequent loop opts round, the inner loop is > transformed into a counted loop. The limit for the inner loop is set, > when the loop nest is created, so it's expected there's no need for a > loop limit check when the counted loop is created. The assert fires > because, when the counted loop is created, it is found that it needs a > loop limit check. The reason for that is that the limit is > transformed, between nest creation and counted loop creation, in a way > that the range of values of the inner loop's limit becomes > unknown. The limit when the nest is created is: > > > 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 > 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] > 122 ConvL2I === _ 112 [[ ]] #int > > > The type of 122 is `2..6` but it is then transformed to: > > > 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 191 ConvL2I === _ 106 [[ 196 ]] #int > 195 ConI === 0 [[ 196 ]] #int:max-1 > 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] > > > That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI > (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of > values returns TypeInt::INT and the bounds of the limit are lost. I > propose adding a `CastII` after the `ConvL2I` so the range of values > of the limit doesn't get lost. This pull request has now been integrated. Changeset: e1b0af29 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/e1b0af29e47b46879defce1fc44c30d4d50d0c31 Stats: 51 lines in 2 files changed: 50 ins; 0 del; 1 mod 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed Reviewed-by: chagedorn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/17965 From roland at openjdk.org Tue Mar 19 08:00:27 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 19 Mar 2024 08:00:27 GMT Subject: Integrated: 8308660: C2 compilation hits 'node must be dead' assert In-Reply-To: References: Message-ID: <76I7aZVjWcbCQRtFNtGRZ_jaLa4iMtX1Y5jyEvJalUo=.beff5d1a-001c-4794-89dc-9ba97119b59d@github.com> On Thu, 14 Mar 2024 14:12:09 GMT, Roland Westrelin wrote: > In `IfNode::fold_compares_helper()`, `adjusted_val` is: > > > (SubI (AddI top constant) 0) > > > which is then transformed to the `top` node. The code, next, tries to > destroy the `adjusted_val` node i.e. the `top` node. That results in > the assert failure. Given We're trying to fold 2 ifs in a dying part > of the graph, the fix is straightforward: test `adjusted_val` for top > and bail out from the transformation if that's the case. This pull request has now been integrated. Changeset: 053ff76e Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/053ff76e14046f796f6e10a9cb2ede1f1ae22ed6 Stats: 73 lines in 2 files changed: 73 ins; 0 del; 0 mod 8308660: C2 compilation hits 'node must be dead' assert Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18305 From duke at openjdk.org Tue Mar 19 11:16:42 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Tue, 19 Mar 2024 11:16:42 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v9] In-Reply-To: References: Message-ID: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ArsenyBochkarev has updated the pull request incrementally with one additional commit since the last revision: Optimize last 'upper' load in update_word_crc32 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17046/files - new: https://git.openjdk.org/jdk/pull/17046/files/d640af0e..654b25b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=07-08 Stats: 10 lines in 1 file changed: 7 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From duke at openjdk.org Tue Mar 19 11:53:26 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Tue, 19 Mar 2024 11:53:26 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v9] In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 11:16:42 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with one additional commit since the last revision: > > Optimize last 'upper' load in update_word_crc32 I managed to get some additional acceleration for cases with `Zba` enabled. Updated data for StarFive VisionFive2: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | ---------------------------- | --------------- | --------- | ----- | ----------- | -------- | --------- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4231.837 | 12.249 | ops/ms | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2678.843 | 1.631 | ops/ms | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1405.024 | 6.509 | ops/ms | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 727.608 | 1.393 | ops/ms | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 186.552 | 0.389 | ops/ms | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 23.423 | 0.087 | ops/ms | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.493 | 0.015 | ops/ms Results for disabled intrinsic are [here](https://github.com/openjdk/jdk/pull/17046#issuecomment-1850364667) > performance numbers on unmatched Current data for HiFive Unmatched (no `Zba` here!): Enabled intrinsic: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | -------------------------------- | ---------- | -------- | ------ | ------- | ----------- | ------- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 3180.082 | ? 63.442 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 1936.728 | ? 17.332 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1019.500 | ? 5.038 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 527.775 | ? 2.059 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 135.190 | ? 0.279 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 16.996 | ? 0.066 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 3.877 | ? 0.011 | ops/ms | Disabled intrinsic: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | ------ | ------------ | ----------- | -------- | -------- |-------- | ----- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 992.300 | ? 17.666 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 818.234 | ? 9.767 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 605.509 | ? 14.685 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 402.414 | ? 4.331 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 134.390 | ? 1.399 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 18.619 | ? 0.104 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 4.229 | ? 0.020 | ops/ms | ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-2006979085 From roland at openjdk.org Tue Mar 19 13:27:31 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 19 Mar 2024 13:27:31 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs Message-ID: Range check `CastII` nodes are removed once loop opts are over. The test case for this change includes 3 cases where elimination of a range check `CastII` causes a crash in compiled code because either a out of bounds array load or a division by zero happen. In `test1`: - the range checks for the `array[otherArray.length]` loads constant fold: `otherArray.length` is a `CastII` of i at the `otherArray` allocation. `i` is less than 9. The `CastII` at the allocation narrows the type down further to `[0-9]`. - the `array[otherArray.length]` loads are control dependent on the unrelated: if (flag == 0) { test. There's an identical dominating test which replaces that one. As a consequence, the `array[otherArray.length]` loads become control dependent on the dominating test. - The `CastII` nodes at the `otherArray` allocations are replaced by a dominating range check `CastII` nodes for: newArray[i] = 42; - After loop opts, the range check `CastII` nodes are removed and the 2 `array[otherArray.length]` loads common at the first: if (flag == 0) { test before the: float[] otherArray = new float[i]; and newArray[i] = 42; that guarantee `i` is positive. - `test1` is called with `i = -1`, the array load proceeds with an out of bounds index and the crash occurs. `test2` and `test3` are mostly identical except for the check that's eliminated (a null divisor check) and the instruction that causes a fault (an integer division). The fix I propose is to not eliminate range check `CastII` nodes after loop opts. When range check`CastII` nodes were introduced, performance was observed to regress. Removing them after loop opts was found to preserve both correctness and performance. Today, the performance regression still exists when `CastII` nodes are left in. So I propose we keep them until the end of optimizations (so the 2 array loads above don't lose a dependency and wrongly common) but remove them at the end of all optimizations. In the case of the array loads, they are dependent on a range check for another array through a range check `CastII` and we must not lose that dependency otherwise the array loads could float above the range check at gcm time. I propose we deal with that problem the way it's handled for `CastPP` nodes: add the dependency to the load (or division)nodes as a precedence edge when the cast is removed. @TobiHartmann ran performance testing for that patch (Thanks!) and reported no regression. ------------- Commit messages: - test and fix Changes: https://git.openjdk.org/jdk/pull/18377/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18377&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8324517 Stats: 212 lines in 4 files changed: 186 ins; 23 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18377.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18377/head:pull/18377 PR: https://git.openjdk.org/jdk/pull/18377 From qamai at openjdk.org Tue Mar 19 13:48:25 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 19 Mar 2024 13:48:25 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: <_Yu40XYH42rezeLmpW4jc39aKsOJVnON-dHtcTQKNKU=.9dbec056-cf53-44dd-916b-0b2dd30d0766@github.com> References: <_Yu40XYH42rezeLmpW4jc39aKsOJVnON-dHtcTQKNKU=.9dbec056-cf53-44dd-916b-0b2dd30d0766@github.com> Message-ID: <90NYbXp5PvRgwz2j4vpaZKlRF6VQ_8FXENW-JvqOzHM=.a3634d68-6b5d-4724-9fba-7f6d7611a4d2@github.com> On Tue, 19 Mar 2024 05:08:53 GMT, Jasmine Karthikeyan wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> update the package name for tests > > Ah, I had assumed the transformation was valid beforehand, since it had existed for a while ? The issue only impacts `-1`, right? Since the comparison should succeed for both `m >=0` and `m < -1`. I think it would be good to address it in this patch, as it's refactoring the existing code. > > [The original patch](https://github.com/openjdk/jdk/commit/415eda1274fce4ddc78dd2221abe2ce61f7ab7f2) seems to primarily test with `array.length` as the `m` value, so the value set was nonnegative. I think we can limit the `((x & m) u< m + 1)` transform to cases where `m` is known to be nonnegative and maintain the intent behind the transform. Something like: > > -} else if (_test._test == BoolTest::lt && cmp2->Opcode() == Op_AddI && cmp2->in(2)->find_int_con(0) == 1) { > +} else if (_test._test == BoolTest::lt && cmp2->Opcode() == Op_AddI && cmp2->in(2)->find_int_con(0) == 1 && phase->type(cmp2->in(1))->is_int()->_lo >= 0) { > > With the IR test being modified accordingly. It'd also be good to write an IR test method that verifies that the transform doesn't take place if `m` doesn't succeed the `_lo >= 0` test. @jaskarth Optimally, `(x & m) u< m + 1` can be transformed into `m != -1` but I think limiting it to non-negative `m` seems to be a reasonable approach. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2007221375 From mli at openjdk.org Tue Mar 19 14:20:36 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 19 Mar 2024 14:20:36 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v6] In-Reply-To: References: Message-ID: <1T0FRJWu0I5-AHFhl03RtTdTNQRR_rN9zZBra3TDuZ8=.926b109c-a23f-4a38-aea1-3a7862c3fadb@github.com> On Mon, 18 Mar 2024 13:27:35 GMT, Emanuel Peter wrote: > Thanks for the changes, the comments help! > > You could still use full names, rather than `e` and `f`. > > Left a few comments. There are a few int/long issues like this: > > ``` > jshell> int x = 52; > x ==> 52 > > jshell> long y = 1 << x; > y ==> 1048576 > > jshell> long y = 1L << x > y ==> 4503599627370496 > ``` > > I think I understand how you want to do the generation now. But as you see, there was a bug in it. And it was hard to find. That is why I was asking for just pure random value generation. Constructing values ourselves often goes wrong, and then the test-coverage drops quickly. > Thanks a lot for catching this! > Also: what is the probability that you will ever generate a `infty` or a `Nan` for doubles? Can you give me an estimate? For inf, the probability is 2 in all runs (+inf, -inf); for NaN, the probability is (eStep/eBound) (but depends on rand value), iin new version I modify it to make sure it tests NaN, so it's (eStep/eBound) now. > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 57: > >> 55: public static void main(String args[]) { >> 56: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.3", "-XX:MaxVectorSize=16"); >> 57: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.3", "-XX:MaxVectorSize=32"); > > You should either drop the `"-XX:MaxVectorSize=16"`, or at least have a run without this flag. > There are machines with higher max vector length, i.e. AVX512 have `64`. Would be nice to test those too ;) Thanks for the suggestion. Unfortunately, I don't have access to a machine with AVX512, but I do run with a aarch64 via qemu where max vector size > 16, and it works with "-XX:MaxVectorSize=16". The reason why the previous test failed (which I fixed in previous commit) with "-XX:MaxVectorSize=8", is because in test framework, it checks the length of vector and make sure it > length of double(8 bytes), i.e. at least 2*(length of Double). > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 131: > >> 129: // negative values >> 130: bits = bits | (1 << 63); >> 131: input[eiIdx*2+1] = Double.longBitsToDouble(bits); > > Is this some sort of luck, or why is this always in bounds? This is always in bounds of array, because it's calculated in such a way, eBound = 1 << 11, max(eiIdx) == eBound/8 == 256, max(eiIdx*2+1) == (256*2+1) == 513; ------------- PR Comment: https://git.openjdk.org/jdk/pull/17753#issuecomment-2007304368 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1530464928 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1530465106 From mli at openjdk.org Tue Mar 19 14:20:36 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 19 Mar 2024 14:20:36 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v7] In-Reply-To: References: Message-ID: <9cGvWYaruKa-dPykr4DvMKbatRTlE57fy5sCtlZ--v0=.f68b8cef-98e1-40cb-96fc-cd6fdcafb242@github.com> > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request incrementally with three additional commits since the last revision: - make sure NaN, Inf are tested in Double test - add pure random tests; fix golden_round - minor comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/3f50c062..eb7dab62 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=05-06 Stats: 77 lines in 2 files changed: 61 ins; 0 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From duke at openjdk.org Tue Mar 19 14:34:21 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Tue, 19 Mar 2024 14:34:21 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v9] In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 11:16:42 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with one additional commit since the last revision: > > Optimize last 'upper' load in update_word_crc32 Perhaps we should do this intrinsic Zba-exclusive? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-2007340102 From mli at openjdk.org Tue Mar 19 15:04:46 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 19 Mar 2024 15:04:46 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v8] In-Reply-To: References: Message-ID: > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: rename ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/eb7dab62..8890cffd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=06-07 Stats: 75 lines in 2 files changed: 0 ins; 0 del; 75 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From bkilambi at openjdk.org Tue Mar 19 15:47:24 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 19 Mar 2024 15:47:24 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 12:41:18 GMT, Emanuel Peter wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Naming changes: replace strict/non-strict with more technical terms > > src/hotspot/share/opto/vectorIntrinsics.cpp line 2: > >> 1: /* >> 2: * Copyright (c) 2020, 2024, Oracle and/or its affiliates. All rights reserved. > > Sometimes you update ARM copyright, sometimes the Oracle one. Is that intended? Hi, I don't fully understand. I have updated Oracle's copyright year in this file and ARM copyright in the *ad files. Should I not be updating the Oracle's copyright year to 2024? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1530645420 From epeter at openjdk.org Tue Mar 19 16:15:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Mar 2024 16:15:23 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: <_kOSajjB3Fm4A_YjYEa07DIM4_59brIhAn5HcvO1yqI=.0518aa38-8a52-4f86-adf4-ae47969d4d9a@github.com> On Tue, 19 Mar 2024 15:44:55 GMT, Bhavana Kilambi wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 2: >> >>> 1: /* >>> 2: * Copyright (c) 2020, 2024, Oracle and/or its affiliates. All rights reserved. >> >> Sometimes you update ARM copyright, sometimes the Oracle one. Is that intended? > > Hi, I don't fully understand. I have updated Oracle's copyright year in this file and ARM copyright in the *ad files. Should I not be updating the Oracle's copyright year to 2024? I don't know what is the policy. It is probably ok, so never mind ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1530692659 From epeter at openjdk.org Tue Mar 19 16:19:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Mar 2024 16:19:21 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v3] In-Reply-To: References: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> Message-ID: <89nYlsd2Lj9mt0Dy6ws1mP2sXEoyj5kGC6KlSvw-m9k=.cf67d0a4-3528-48e7-b4cf-864bf39b9711@github.com> On Wed, 13 Mar 2024 02:02:34 GMT, Kangcheng Xu wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> modification per code review suggestions > > Oops. Package name updated. Sorry for such a rookie mistake! @tabjy I am re-running testing, then will re-review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2007605507 From duke at openjdk.org Tue Mar 19 17:07:46 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Tue, 19 Mar 2024 17:07:46 GMT Subject: RFR: 8317720: RISC-V: Implement Adler32 intrinsic Message-ID: Hello everyone! Please review this non-vectorized implementation of `_updateBytesAdler32` intrinsic. Reference implementation for AArch64 can be found [here](https://github.com/openjdk/jdk9/blob/master/hotspot/src/cpu/aarch64/vm/stubGenerator_aarch64.cpp#L3281). ### Correctness checks Test `test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java` is ok. All tier1 also passed. ### Performance results on T-Head board Enabled intrinsic: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- | | Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 5522.693 | 23.387 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 3430.761 | 9.210 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 1962.888 | 5.323 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 1050.938 | 0.144 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 549.227 | 0.375 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 280.829 | 0.170 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 116.333 | 0.057 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 71.392 | 0.060 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 35.784 | 0.019 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 17.924 | 0.010 | ops/ms | | Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 8.940 | 0.003 | ops/ms | Disabled intrinsic: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- | |Adler32.TestAdler32.testAdler32Update|64|thrpt|25|655.633|5.845|ops/ms| |Adler32.TestAdler32.testAdler32Update|128|thrpt|25|587.418|10.062|ops/ms| |Adler32.TestAdler32.testAdler32Update|256|thrpt|25|546.675|11.598|ops/ms| |Adler32.TestAdler32.testAdler32Update|512|thrpt|25|432.328|11.517|ops/ms| |Adler32.TestAdler32.testAdler32Update|1024|thrpt|25|311.771|4.238|ops/ms| |Adler32.TestAdler32.testAdler32Update|2048|thrpt|25|202.648|2.486|ops/ms| |Adler32.TestAdler32.testAdler32Update|5012|thrpt|25|100.246|1.119|ops/ms| |Adler32.TestAdler32.testAdler32Update|8192|thrpt|25|65.931|0.546|ops/ms| |Adler32.TestAdler32.testAdler32Update|16384|thrpt|25|34.570|0.353|ops/ms| |Adler32.TestAdler32.testAdler32Update|32768|thrpt|25|17.622|0.190|ops/ms| |Adler32.TestAdler32.testAdler32Update|65536|thrpt|25|8.895|0.087|ops/ms| ------------- Commit messages: - 8317720: RISC-V: Implement Adler32 intrinsic Changes: https://git.openjdk.org/jdk/pull/18382/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18382&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8317720 Stats: 274 lines in 2 files changed: 274 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18382/head:pull/18382 PR: https://git.openjdk.org/jdk/pull/18382 From yzheng at openjdk.org Tue Mar 19 19:06:36 2024 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 19 Mar 2024 19:06:36 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic [v2] In-Reply-To: References: Message-ID: <25CO5hMsh1UPgSEZNT3ywdBxRs7EHhYTiYxWDuakfKc=.35f7b465-36de-4152-abbe-397e92aba117@github.com> > Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: address comment. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18226/files - new: https://git.openjdk.org/jdk/pull/18226/files/37232a5f..bfc323b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18226&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18226&range=00-01 Stats: 70 lines in 11 files changed: 8 ins; 28 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/18226.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18226/head:pull/18226 PR: https://git.openjdk.org/jdk/pull/18226 From yzheng at openjdk.org Tue Mar 19 19:06:36 2024 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 19 Mar 2024 19:06:36 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic [v2] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 16:55:28 GMT, Damon Fenacci wrote: > Quite a simplification! Have you checked if there are any performance differences? Ran https://github.com/oracle/graal/blob/master/compiler/src/org.graalvm.micro.benchmarks/src/micro/benchmarks/BigIntegerBenchmark.java The results are $ before change Benchmark Mode Cnt Score Error Units BigIntegerBenchmark.bigIntMontgomeryMul thrpt 5 122.488 ? 0.130 ops/s BigIntegerBenchmark.bigIntMontgomerySqr thrpt 5 76.023 ? 0.106 ops/s BigIntegerBenchmark.bigIntMul thrpt 5 330.130 ? 0.349 ops/s BigIntegerBenchmark.bigIntMulAdd thrpt 5 455.590 ? 0.663 ops/s $ after change Benchmark Mode Cnt Score Error Units BigIntegerBenchmark.bigIntMontgomeryMul thrpt 5 124.407 ? 0.045 ops/s BigIntegerBenchmark.bigIntMontgomerySqr thrpt 5 76.036 ? 0.232 ops/s BigIntegerBenchmark.bigIntMul thrpt 5 329.836 ? 0.953 ops/s BigIntegerBenchmark.bigIntMulAdd thrpt 5 456.485 ? 0.766 ops/s ------------- PR Comment: https://git.openjdk.org/jdk/pull/18226#issuecomment-2007922439 From dlong at openjdk.org Tue Mar 19 19:43:20 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 19 Mar 2024 19:43:20 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic [v2] In-Reply-To: <25CO5hMsh1UPgSEZNT3ywdBxRs7EHhYTiYxWDuakfKc=.35f7b465-36de-4152-abbe-397e92aba117@github.com> References: <25CO5hMsh1UPgSEZNT3ywdBxRs7EHhYTiYxWDuakfKc=.35f7b465-36de-4152-abbe-397e92aba117@github.com> Message-ID: On Tue, 19 Mar 2024 19:06:36 GMT, Yudi Zheng wrote: >> Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > address comment. src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 3559: > 3557: Register tmp5, Register tmp6, Register product_hi, Register tmp8) { > 3558: > 3559: assert_different_registers(x, xlen, y, ylen, z, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp8); Suggestion: assert_different_registers(x, xlen, y, ylen, z, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp8, product_hi); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18226#discussion_r1530980566 From dlong at openjdk.org Tue Mar 19 19:46:20 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 19 Mar 2024 19:46:20 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic [v2] In-Reply-To: <25CO5hMsh1UPgSEZNT3ywdBxRs7EHhYTiYxWDuakfKc=.35f7b465-36de-4152-abbe-397e92aba117@github.com> References: <25CO5hMsh1UPgSEZNT3ywdBxRs7EHhYTiYxWDuakfKc=.35f7b465-36de-4152-abbe-397e92aba117@github.com> Message-ID: <2zUQ2j5f9hwiK70250bR627K7vDslkAWA9pMzdXwqYI=.e2239329-eb87-4dd3-b70c-c928dd9b757c@github.com> On Tue, 19 Mar 2024 19:06:36 GMT, Yudi Zheng wrote: >> Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > address comment. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4670: > 4668: const Register tmp6 = r15; > 4669: const Register tmp7 = r16; > 4670: const Register tmp8 = r17; It looks like tmp8 is never used. The call to multiply_to_len() below needs to be adjusted. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18226#discussion_r1530986811 From bkilambi at openjdk.org Tue Mar 19 20:33:22 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 19 Mar 2024 20:33:22 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: <0jAgvIIRdk1GGdCqWIIIWh4Hv9kxEQHLDIqhXOi-Ir4=.0c8772f2-05b2-4c10-97cb-6ac3e953573b@github.com> On Mon, 18 Mar 2024 12:35:50 GMT, Emanuel Peter wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Naming changes: replace strict/non-strict with more technical terms > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 2887: > >> 2885: // Such nodes can only be generated by Vector API. >> 2886: // 2. Non-associative (or strictly ordered) AddReductionVF, which can only be generated by >> 2887: // auto-vectorization on SVE machine. > > I'd be careful with such strong statements. You can name them as examples, but the "only" may not stay true forever. We may for example at some point add some way to have a `Float.addAssociative`, to allow "fast-math" reductions in plain java, which can then be auto-vectorized with a non-strict ordered implementation. I'm not sure if we will ever do that, but it is possible. Agreed, will make the changes. Thanks ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1531066198 From bkilambi at openjdk.org Tue Mar 19 20:51:22 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 19 Mar 2024 20:51:22 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 12:52:37 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/vectornode.cpp line 1332: >> >>> 1330: case Op_AddReductionVL: return new AddReductionVLNode(ctrl, n1, n2); >>> 1331: case Op_AddReductionVF: return new AddReductionVFNode(ctrl, n1, n2, is_associative); >>> 1332: case Op_AddReductionVD: return new AddReductionVDNode(ctrl, n1, n2, is_associative); >> >> Why do you only do it for the `F/D` `Add` instructions, but not the `Mul` instructions? Would those not equally profit from associativity? > > I'm not super familiar with the Vector API, but I could not see that MUL is not associative. Yes, MUL is non-associative in VectorAPI just like ADD operation (according to the description here - https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc). We found a significant perf difference between the SVE "fadda" instruction which is a strictly ordered instruction vs Neon instructions on a 128-bit SVE machine especially after this optimization - https://bugs.openjdk.org/browse/JDK-8298244 but there's no such performance difference for the MUL operation. MulReductionVF/VD do not have direct instructions for multiply reduction nor do they have separate ISA for strictly ordered or non-strictly ordered. So, currently we do not have any data that shows any benefit to add similar code for MUL and thus it's currently considered to be a non-associative operation (strictly ordered). I am not sure about other platforms. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1531091964 From yzheng at openjdk.org Tue Mar 19 21:09:31 2024 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 19 Mar 2024 21:09:31 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic [v3] In-Reply-To: References: Message-ID: > Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: address comment. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18226/files - new: https://git.openjdk.org/jdk/pull/18226/files/bfc323b7..870a6127 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18226&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18226&range=01-02 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18226.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18226/head:pull/18226 PR: https://git.openjdk.org/jdk/pull/18226 From jkarthikeyan at openjdk.org Wed Mar 20 02:05:56 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 20 Mar 2024 02:05:56 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v8] In-Reply-To: References: Message-ID: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Apply changes from code review and add IR test for vectorization and reduction ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17574/files - new: https://git.openjdk.org/jdk/pull/17574/files/f929239a..3882d241 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=06-07 Stats: 313 lines in 3 files changed: 311 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17574/head:pull/17574 PR: https://git.openjdk.org/jdk/pull/17574 From jkarthikeyan at openjdk.org Wed Mar 20 02:10:31 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 20 Mar 2024 02:10:31 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> Message-ID: On Tue, 5 Mar 2024 11:14:51 GMT, Emanuel Peter wrote: >>> You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? >> >> Yeah, I was thinking it might be better to let the CMove transform happen first, since the conditions guarding both transforms are aiming to do the same thing in essence. My thought was that if the regression in your `testCostDifference` was fixed, it would be better to not have to do that fix in two different locations, since it impacts `is_minmax` as well. >> >>> BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday >> >> Thank you for linking this talk, it was really insightful! I also wonder if it would be possible to capture branch execution patterns somehow, to drive branch flattening optimizations. I figure it could be possible to keep track of the sequence of a branch's history of execution, and then compute some "entropy" value from that sequence to determine if there's a pattern, or if it's random and likely to be mispredicted. However, implementing that in practice sounds pretty difficult. >> >> @eme64 I've pushed a commit that fixes the benchmarks and sets the loop iteration count to 10_000. Could you check if this lets it vectorize on your machine? Thanks! > > @jaskarth Why don't you first make the code change with starting from a `Cmp -> CMove` pattern rather than the `Cmp -> If -> Phi` pattern. Then I can look at both things together ;) Thank you for the re-review @eme64! I've pushed a commit that should address your comments. I've also added an IR test that verifies that min/max vectorization and reduction is taking place, as requested. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-2008563833 From fyang at openjdk.org Wed Mar 20 02:49:19 2024 From: fyang at openjdk.org (Fei Yang) Date: Wed, 20 Mar 2024 02:49:19 GMT Subject: RFR: 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 03:44:10 GMT, Gui Cao wrote: > Hi, The current behavior of C2_MacroAssembler::arrays_equals always load longword before comparison. > When array[0] is aligned to 32-bit (especially after JDK-8139457 which tries to relax alignment > of array elements), the last longword load will exceed the array limit and may touch the next > word beyond object layout in heap memory. So this should bear a similar problem as JDK-8328138. > > Proposed fix changes this behavior and aligns with handling in C2_MacroAssembler::string_equals, > which will check the number of remaining array elements before loading the next longword. > No obvious changes witnessed from the JMH numbers or benchmarks like SPECjbb2015. > > Patch also removed the AvoidUnalignedAccesses check in C2_MacroAssembler::string_equals as we > don't see extra performance gain when setting AvoidUnalignedAccesses to false when testing the > JMH tests or benchmarks like SPECjbb2015 on three popular RISC-V hardware platforms. We can > consider adding it back if it turns out to be usefull on future new hardwares. > > > ### Correctness test: > - [x] Run tier1-3, hotspot:tier4 tests on LicheePi 4A (release) > - [x] Run tier1-3, hotspot:tier4 tests on SOPHON SG2042 (release) > > > ### JMH test: > > #### 1. test/micro/org/openjdk/bench/java/util/ArraysEquals.java > 1. SiFive unmatched > > Before: > Benchmark Mode Cnt Score Error Units > ArraysEquals.testByteFalseBeginning avgt 12 37.804 ? 7.292 ns/op > ArraysEquals.testByteFalseEnd avgt 12 77.972 ? 3.208 ns/op > ArraysEquals.testByteFalseMid avgt 12 54.427 ? 6.436 ns/op > ArraysEquals.testByteTrue avgt 12 75.121 ? 5.172 ns/op > ArraysEquals.testCharFalseBeginning avgt 12 42.486 ? 6.526 ns/op > ArraysEquals.testCharFalseEnd avgt 12 122.208 ? 2.533 ns/op > ArraysEquals.testCharFalseMid avgt 12 83.891 ? 3.680 ns/op > ArraysEquals.testCharTrue avgt 12 122.096 ? 5.519 ns/op > > After: > Benchmark Mode Cnt Score Error Units > ArraysEquals.testByteFalseBeginning avgt 12 32.638 ? 7.279 ns/op > ArraysEquals.testByteFalseEnd avgt 12 73.013 ? 8.081 ns/op > ArraysEquals.testByteFalseMid avgt 12 43.619 ? 6.104 ns/op > ArraysEquals.testByteTrue avgt 12 83.044 ? 8.207 ns/op > ArraysEquals.testCharFalseBeginning avgt 12 39.154 ? 5.233 ns/op > ArraysEquals.testCharFalseEnd avgt 12 122.072 ? 7.784 ns/op > ArraysEquals.testCharFalseMid avgt 12 67.831 ? 9.218 ns/op > Ar... Looks fine. In fact, array[0] could have an alignement of 32-bit after JDK-8139457 when running with -XX:-UseCompressedClassPointers. In this case, we have base_offset = 20 (bytes). It will also an issue when we add support for lilliput on riscv some day, in which case we will have base_offset = 12 (bytes). ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18370#pullrequestreview-1947866478 From jzhu at openjdk.org Wed Mar 20 03:55:33 2024 From: jzhu at openjdk.org (Joshua Zhu) Date: Wed, 20 Mar 2024 03:55:33 GMT Subject: RFR: 8326541: [AArch64] ZGC C2 load barrier stub considers the length of live registers when spilling registers [v4] In-Reply-To: References: Message-ID: > Currently ZGC C2 load barrier stub saves the whole live register regardless of what size of register is live on aarch64. > Considering the size of SVE register is an implementation-defined multiple of 128 bits, up to 2048 bits, > even the use of a floating point may cause the maximum 2048 bits stack occupied. > Hence I would like to introduce this change on aarch64: take the length of live registers into consideration in ZGC C2 load barrier stub. > > In a floating point case on 2048 bits SVE machine, the following ZLoadBarrierStubC2 > > > ...... > 0x0000ffff684cfad8: stp x15, x18, [sp, #80] > 0x0000ffff684cfadc: sub sp, sp, #0x100 > 0x0000ffff684cfae0: str z16, [sp] > 0x0000ffff684cfae4: add x1, x13, #0x10 > 0x0000ffff684cfae8: mov x0, x16 > ;; 0xFFFF803F5414 > 0x0000ffff684cfaec: mov x8, #0x5414 // #21524 > 0x0000ffff684cfaf0: movk x8, #0x803f, lsl #16 > 0x0000ffff684cfaf4: movk x8, #0xffff, lsl #32 > 0x0000ffff684cfaf8: blr x8 > 0x0000ffff684cfafc: mov x16, x0 > 0x0000ffff684cfb00: ldr z16, [sp] > 0x0000ffff684cfb04: add sp, sp, #0x100 > 0x0000ffff684cfb08: ptrue p7.b > 0x0000ffff684cfb0c: ldp x4, x5, [sp, #16] > ...... > > > could be optimized into: > > > ...... > 0x0000ffff684cfa50: stp x15, x18, [sp, #80] > 0x0000ffff684cfa54: str d16, [sp, #-16]! // extra 8 bytes to align 16 bytes in push_fp() > 0x0000ffff684cfa58: add x1, x13, #0x10 > 0x0000ffff684cfa5c: mov x0, x16 > ;; 0xFFFF7FA942A8 > 0x0000ffff684cfa60: mov x8, #0x42a8 // #17064 > 0x0000ffff684cfa64: movk x8, #0x7fa9, lsl #16 > 0x0000ffff684cfa68: movk x8, #0xffff, lsl #32 > 0x0000ffff684cfa6c: blr x8 > 0x0000ffff684cfa70: mov x16, x0 > 0x0000ffff684cfa74: ldr d16, [sp], #16 > 0x0000ffff684cfa78: ptrue p7.b > 0x0000ffff684cfa7c: ldp x4, x5, [sp, #16] > ...... > > > Besides the above benefit, when we know what size of register is live, > we could remove the unnecessary caller save in ZGC C2 load barrier stub when we meet C-ABI SOE fp registers. > > Passed jtreg with option "-XX:+UseZGC -XX:+ZGenerational" with no failures introduced. Joshua Zhu has updated the pull request incrementally with one additional commit since the last revision: Add more output for easy debugging once the jtreg test case fails ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17977/files - new: https://git.openjdk.org/jdk/pull/17977/files/382866f7..f2960eb1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17977&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17977&range=02-03 Stats: 21 lines in 1 file changed: 17 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17977.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17977/head:pull/17977 PR: https://git.openjdk.org/jdk/pull/17977 From jzhu at openjdk.org Wed Mar 20 03:55:34 2024 From: jzhu at openjdk.org (Joshua Zhu) Date: Wed, 20 Mar 2024 03:55:34 GMT Subject: RFR: 8326541: [AArch64] ZGC C2 load barrier stub considers the length of live registers when spilling registers [v3] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 14:37:38 GMT, Stuart Monteith wrote: > In the event of failure, would it be possible to print the erroneous output? The output from the subprocesses, being directly piped in, doesn't lend itself to easy debugging. At first I thought there might be an option that could alter OutputAnalyzers output, but sadly not. Done. Thanks for your comments. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17977#discussion_r1531483040 From thartmann at openjdk.org Wed Mar 20 06:24:29 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 20 Mar 2024 06:24:29 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining [v2] In-Reply-To: References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: <8CgH5oWzIXdHTCFbdRQivXR1NJE4HpJ7JaeofgXAgIE=.f3175732-d03e-4f3e-b4c3-fb3be4a549c1@github.com> On Sat, 16 Mar 2024 22:19:10 GMT, Jatin Bhateja wrote: >> This bug fix patch fixes a crash occurring due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. >> >> >> ByteVector sliceTemplate(int origin, Vector v1) { >> ByteVector that = (ByteVector) v1; >> that.check(this); >> Objects.checkIndex(origin, length() + 1); >> VectorShuffle iota = iotaShuffle(); >> VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] >> iota = iotaShuffle(origin, 1, true); [B] >> return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] >> } >> >> >> >> Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. >> >> PredictedCallGenerators (Vector.compare) >> PredictedCallGenerators (Byte256Vector.compare) >> ParseGenerator (Byte256Vector.compare) [D] >> UncommonTrap (receiver other than Byte256Vector) >> PredictedCallGenerators (Byte128Vector.compare) >> ParseGenerator (Byte128Vector.compare) [E] >> UncommonTrap (receiver other than Byte128Vector) [F] >> PredictedCallGenerators (UncommonTrap) >> [converged state] = Merge JVM State orginating from C and E [G] >> >> Since top level receiver of sliceTemplate is Byte256Vector hence while executing the call generator for Byte128Vector.compare (see code at line E) compiler observes a mismatch b/w incoming argument species i.e. one argument is a 256 bit vector while other is 128 bit vector and throws an exception. >> >> At state convergence point (see code at line G), since one of the c... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Restricting the patch to only bi-morphic inlining crash fix. I'm late here but is there a reason no regression test was added? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18282#issuecomment-2008724565 From stuefe at openjdk.org Wed Mar 20 06:26:24 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 20 Mar 2024 06:26:24 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. Closed, will be part of https://github.com/openjdk/jdk/pull/18230 ------------- PR Comment: https://git.openjdk.org/jdk/pull/18235#issuecomment-2008726551 From stuefe at openjdk.org Wed Mar 20 06:26:24 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 20 Mar 2024 06:26:24 GMT Subject: Withdrawn: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/18235 From galder at openjdk.org Wed Mar 20 09:04:36 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 20 Mar 2024 09:04:36 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v7] In-Reply-To: References: Message-ID: > Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. > > The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: > > > $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op > ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op > ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op > ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op > ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op > ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op > ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op > ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op > ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op > ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op > ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op > ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op > ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op > ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op > ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op > > > It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. > > I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. > > > $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" > ... > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 > > > One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? > > Thanks @rwestrel for his help shaping this up :) Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - Merge branch 'master' into topic.0131.c1-array-clone - Merge branch 'master' into topic.0131.c1-array-clone - Reserve necessary frame map space for clone use cases - 8302850: C1 primitive array clone intrinsic in graph * Combine array length, new type array and arraycopy for clone in c1 graph. * Add OmitCheckFlags to skip arraycopy checks. * Instantiate ArrayCopyStub only if necessary. * Avoid zeroing newly created arrays for clone. * Add array null after c1 clone compilation test. * Pass force reexecute to intrinsic via value stack. This is needed to be able to deoptimize correctly this intrinsic. * When new type array or array copy are used for the clone intrinsic, their state needs to be based on the state before for deoptimization to work as expected. - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" This reverts commit fe5d916724614391a685bbef58ea939c84197d07. - 8302850: Link code emit infos for null check and alloc array - 8302850: Null check array before getting its length * Added a jtreg test to verify the null check works. Without the fix this test fails with a SEGV crash. - 8302850: Force reexecuting clone in case of a deoptimization * Copy state including locals for clone so that reexecution works as expected. - 8302850: Avoid instantiating array copy stub for clone use cases - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 * Clone calls that involve Phi nodes are not supported. * Add unimplemented stubs for other platforms. ------------- Changes: https://git.openjdk.org/jdk/pull/17667/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17667&range=06 Stats: 218 lines in 16 files changed: 184 ins; 4 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/17667.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17667/head:pull/17667 PR: https://git.openjdk.org/jdk/pull/17667 From galder at openjdk.org Wed Mar 20 10:00:29 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 20 Mar 2024 10:00:29 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 18:56:46 GMT, Dean Long wrote: > OK, I missed the is_intrinsic_supported change. The platform-specific changes should probably have a comment saying they are for clone support. That's fine for me, but the platform-specific changes are a bit spread around. Anywhere in particular you'd want the comment(s) to go in? Or shall the comments be added in all platform specific changes? > Also, I was hoping there was a way to minimize platform-specific changes, maybe by handing the force_reexecute inheritance in state_for().... I'm not sure exactly what you mean, but let me try to explain the `x->state_before()` bit. So I guess you're referring to this piece of logic: CodeEmitInfo* info = nullptr; if (x->state_before() != nullptr && x->state_before()->force_reexecute()) { info = state_for(x, x->state_before()); info->set_force_reexecute(); } else { info = state_for(x, x->state()); } This code was added to deal with SEGV failures coming out of `compiler/interpreter/Test6833129.clone_and_verify`, which is configure to stress deoptimizations. In `append_alloc_array_copy`, I construct `NewTypeArray` and the array_copy `Intrinsic` both with `state_before` so that if any deoptimizations happen at either stage, re-execution happen from the point before clone was called. I guess I can move the code up to `LIRGenerator::state_for` and if the intrinsic is either array copy or new type array, apply that logic, is that what you are after? > ... putting the state in x->state() instead of x->state_before() Hmmm, you mean this instead? CodeEmitInfo* info = nullptr; if (x->state_before() != nullptr && x->state_before()->force_reexecute()) { info = state_for(x, x->state()); info->set_force_reexecute(); } else { info = state_for(x, x->state()); } Granted that it could be tidied but not sure I understand how state can work here. Before this change both new type array and array copy called `CodeEmitInfo* info = state_for(x, x->state());` which didn't seem to be enough to be able to go back to re-executing bytecode before clone was called. It seemed to leave things in a half-done state when deoptimizations happened while clone's intrinsic was doing either a new type array or array copy. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-2009168412 From rgiulietti at openjdk.org Wed Mar 20 12:05:34 2024 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Wed, 20 Mar 2024 12:05:34 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v49] In-Reply-To: References: Message-ID: On Fri, 15 Mar 2024 06:20:04 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix tests src/hotspot/share/opto/divnode.cpp line 245: > 243: // to rounding down, now it is guaranteed to be correct, according to > 244: // N-Bit Unsigned Division Via N-Bit Multiply-Add by Arch D. Robison > 245: magic_divide_constants_round_down(divisor, magic_const, shift_const); I think there's no need for `magic_divide_constants_round_down()`. Firstly, we recover the previous value of `magic_const` and `shift_const`, just before the overflow: magic_const = magic_const + 1 >> 1 | 0x8000_0000; shift_const -= 1; Then we decrement `magic_const` by one: magic_const -= 1; That's it. If desired, we can additionally reduce `magic_const` to an odd value by right-shifting it by the number of trailing zero bits, and updating `shift_const` accordingly. But for the usage here, I think it makes no real difference. We can thus avoid the division in `magic_divide_constants_round_down()`, and can get rid of the method altogether, as it seems to be used only here. This might even open similar code for the `julong` case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1531963851 From roland at openjdk.org Wed Mar 20 12:22:34 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 20 Mar 2024 12:22:34 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks Message-ID: Both failures occur because `ABS(scale * stride_con)` overflows (scale a really large long number). I reworked the test so overflow is no longer an issue. ------------- Commit messages: - fix & test Changes: https://git.openjdk.org/jdk/pull/18397/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18397&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8324121 Stats: 70 lines in 3 files changed: 61 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/18397.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18397/head:pull/18397 PR: https://git.openjdk.org/jdk/pull/18397 From epeter at openjdk.org Wed Mar 20 12:38:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 12:38:23 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 02:05:39 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update the package name for tests test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 39: > 37: * @summary Refactor boolean node tautology transformations > 38: * @library /test/lib / > 39: * @run driver compiler.c2.irTests.TestBoolNodeGvn Suggestion: * @run main compiler.c2.irTests.TestBoolNodeGvn otherwise flags from the outside do not apply to this test ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1532003889 From epeter at openjdk.org Wed Mar 20 12:47:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 12:47:23 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 02:05:39 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update the package name for tests The cpp code looks good, I have some small issues with the test. But we are close ;) test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 41: > 39: * @run driver compiler.c2.irTests.TestBoolNodeGvn > 40: */ > 41: public class TestBoolNodeGvn { Suggestion: public class TestBoolNodeGVN { I would have used the capital for GVN, since it is an acronym, and we use it capitalized everywhere. test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 52: > 50: */ > 51: @Test > 52: @Arguments({Argument.DEFAULT, Argument.DEFAULT}) You will have to merge from master, and fix this line. There was a change in the `@Arguments` annotation recently. Suggestion: @Arguments(values = {Argument.DEFAULT, Argument.DEFAULT}) I got lots of build issues: @Arguments({Argument.DEFAULT, Argument.DEFAULT}) ^ symbol: method value() location: @interface Arguments test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 58: > 56: & !(Integer.compareUnsigned((m & x), m) > 0) > 57: & Integer.compareUnsigned((x & m), m + 1) < 0 > 58: & Integer.compareUnsigned((m & x), m + 1) < 0; For easier reading, I would have put the `&` at the end of the line. Btw: is this supposed to be a bitwise or a binary and? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1948764568 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1532014803 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1532009274 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1532009265 From epeter at openjdk.org Wed Mar 20 12:53:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 12:53:21 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v6] In-Reply-To: <1T0FRJWu0I5-AHFhl03RtTdTNQRR_rN9zZBra3TDuZ8=.926b109c-a23f-4a38-aea1-3a7862c3fadb@github.com> References: <1T0FRJWu0I5-AHFhl03RtTdTNQRR_rN9zZBra3TDuZ8=.926b109c-a23f-4a38-aea1-3a7862c3fadb@github.com> Message-ID: On Tue, 19 Mar 2024 14:17:08 GMT, Hamlin Li wrote: >> test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 57: >> >>> 55: public static void main(String args[]) { >>> 56: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.3", "-XX:MaxVectorSize=16"); >>> 57: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.3", "-XX:MaxVectorSize=32"); >> >> You should either drop the `"-XX:MaxVectorSize=16"`, or at least have a run without this flag. >> There are machines with higher max vector length, i.e. AVX512 have `64`. Would be nice to test those too ;) > > Thanks for the suggestion. > Unfortunately, I don't have access to a machine with AVX512, but I do run with a aarch64 via qemu where max vector size > 16, and it works with "-XX:MaxVectorSize=16". > The reason why the previous test failed (which I fixed in previous commit) with "-XX:MaxVectorSize=8", is because in test framework, it checks the length of vector and make sure it > length of double(8 bytes), i.e. at least 2*(length of Double). Would it be an alternative to remove these here in the arguments, but instead to limit the IR rules, so that the IR rules are only checked if the MaxVectorSize is large enough? Example you can look for: `applyIf = {"MaxVectorSize", ">=32"}` I think you would just have a `applyIf = {"MaxVectorSize", ">=16"}`. I will run the tests on our Oracle machines for AVX512, so don't worry about testing that. Maybe you could also simulate a 64 byte register machine with SVE over qemu, but I leave that up to you. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1532024982 From epeter at openjdk.org Wed Mar 20 13:06:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 13:06:22 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v8] In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 15:04:46 GMT, Hamlin Li wrote: >> HI, >> Can you have a look at this patch adding some tests for Math.round instrinsics? >> Thanks! >> >> ### FYI: >> During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. >> It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. >> Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > rename > For inf, the probability is 2 in all runs (+inf, -inf); for NaN, the probability is (eStep/eBound) (but depends on rand value), in new version I modify it to make sure it tests NaN, so it's (eStep/eBound) now. > Just my experience, for Math.round(float/double), special cases like inf/NaN are not the error-prone ones, normal values are more error-prone. Maybe it is not super error-prone, but you never know. I've seen multiple bugs with other float-operations, so now it always makes me nervous when I don't see them. Can you please throw in these special cases explicitly, so we can be sure we have them? I think that the significand/exponent combination works now, but I'm always a bit scared that we still have an error, and we don't see it. That is why I think it would be nice to have at least one very clean random number that is thrown it as well, that has no modifications to it. But your generation method clearly has advantages: we systematically cover all ranges. That is good. Thanks for the work :) ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17753#pullrequestreview-1948822091 From epeter at openjdk.org Wed Mar 20 15:08:51 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 15:08:51 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset Message-ID: I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. Most importantly: I split it into two classes: `PairSet` and `PackSet`. `combine_pairs_to_longer_packs` converts the first into the second. I was able to simplify the combining, and remove the pack-sorting. I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. More details are described in the annotations in the code. ------------- Commit messages: - revert the estimate_cost_savings_when_packing_pair to previous logic - pairset insertion order - improve estimate_cost_savings_when_packing_pair heuristic - fix MulAddS2I case, and add more tests - improve extension cost model - get_right_or_null_for - some const-ness - PairSetIterator now iterates pair-chains - small bugfix, need to check in_bb - rm last Todo - ... and 16 more: https://git.openjdk.org/jdk/compare/251347bd...bdc57434 Changes: https://git.openjdk.org/jdk/pull/18276/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8325252 Stats: 1115 lines in 6 files changed: 483 ins; 339 del; 293 mod Patch: https://git.openjdk.org/jdk/pull/18276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18276/head:pull/18276 PR: https://git.openjdk.org/jdk/pull/18276 From epeter at openjdk.org Wed Mar 20 15:08:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 15:08:55 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 14:25:57 GMT, Emanuel Peter wrote: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. src/hotspot/share/opto/superword.cpp line 49: > 47: _clone_map(phase()->C->clone_map()), // map of nodes created in cloning > 48: _align_to_ref(nullptr), // memory reference to align vectors to > 49: _race_possible(false), // cases where SDMU is true Note: removed it. Will explain where it was used. src/hotspot/share/opto/superword.cpp line 471: > 469: combine_pairs_to_longer_packs(); > 470: > 471: construct_my_pack_map(); Note: `my_pack` is now `packset.pack`, and the map is directly created in `packset.add_pack` in `combine_pairs_to_longer_packs` above. src/hotspot/share/opto/superword.cpp line 873: > 871: return false; > 872: } > 873: Note: this was rather inefficient. Now we use `packset.has_left(n)` or `packset.has_right(n)`, which are simple constant time lookups. src/hotspot/share/opto/superword.cpp line 1056: > 1054: bool changed; > 1055: do { > 1056: packset_sort(_packset.length()); Note: sorting used to be necessary, because the old `combine_pairs_to_longer_packs` required the pairs to all be ordered by `alignment`, such that `(s1, s2)` always comes before `(s2, s3)`. With the new combination method, sorting is no longer necessary. src/hotspot/share/opto/superword.cpp line 1058: > 1056: Node* s1 = pair.left(); > 1057: Node* s2 = pair.right(); > 1058: order_inputs_of_all_use_pairs_to_match_def_pair(s1, s2); Note: `_race_possible` basically is on exactly iff there was some pair that had multiple uses, for which we needed to do `order_def_uses` (now called `order_inputs_of_all_use_pairs_to_match_def_pair`). I think that this happens rather often, and it is not worth having some global flag for this. I now just always enable it. src/hotspot/share/opto/superword.cpp line 1208: > 1206: break; > 1207: } > 1208: Node* use2 = _pairset.get_right_for(use1); Note: another case where we used to search through all packs, and now can simply do a constant-time lookup. src/hotspot/share/opto/superword.cpp line 1208: > 1206: // Find pair (use1, use2) > 1207: Node* use2 = _pairset.get_right_or_null_for(use1); > 1208: if (use2 == nullptr) { break; } Note: instead of searching the whole packset, we now have a constant-time lookup to find the pair. src/hotspot/share/opto/superword.cpp line 1210: > 1208: if (use2 == nullptr) { break; } > 1209: > 1210: order_inputs_of_uses_to_match_def_pair(def1, def2, use1, use2); Note: renamed from `opnd_positions_match` src/hotspot/share/opto/superword.cpp line 1253: > 1251: // 1. Reduction > 1252: if (is_marked_reduction(use1) && is_marked_reduction(use2)) { > 1253: Node* first = use1->in(2); Note: while this refactoring here was not strictly necessary, I still took the time to improve the comments a bit. I needed to understand this anyway for the `_race_possible` removal. src/hotspot/share/opto/superword.cpp line 1330: > 1328: // Check if we have a pair (use1, use2) > 1329: if (!_pairset.has_left(use1)) { continue; } > 1330: Node* use2 = _pairset.get_right_for(use1); Note: another constant-time lookup instead of iterating all packs/pairs. src/hotspot/share/opto/superword.cpp line 1340: > 1338: ct++; > 1339: if (are_adjacent_refs(use1, use2)) { > 1340: save_use += adjacent_profit(use1, use2); Note: we can find the pair directly, no need to search the packset. src/hotspot/share/opto/superword.cpp line 1358: > 1356: int SuperWord::adjacent_profit(Node* s1, Node* s2) { return 2; } > 1357: int SuperWord::pack_cost(int ct) { return ct; } > 1358: int SuperWord::unpack_cost(int ct) { return ct; } Note: I am now using lambda methods. src/hotspot/share/opto/superword.cpp line 1369: > 1367: left = right; > 1368: } > 1369: _packset.add_pack(pack); Note: I replaced a quadratic loop, which basically checked all-with-all pairs, if they can be combined. An additional benefit: we don't need the packs sorted (this also removes the need to have all nodes annotated with alignment, but that will be more useful in a future RFE). src/hotspot/share/opto/superword.cpp line 1393: > 1391: > 1392: // Remove all nullptr from packset > 1393: compress_packset(); Note: the old method left `nullptr` in the packset, the new method does not. src/hotspot/share/opto/superword.cpp line 1587: > 1585: } > 1586: }; > 1587: split_packs(filter_name, split_strategy); Note: I can get rid of a lot of code by just implementing `filter` with a `split`. src/hotspot/share/opto/superword.cpp line 1750: > 1748: }; > 1749: _packset.filter_packs("Superword::filter_packs_for_profitable", > 1750: "not profitable", filter); Note: now that I use `split_packs` to implement `filter_packs`, the repetitions are already taken care of (i.e. repeating while there are changes). src/hotspot/share/opto/superword.cpp line 1774: > 1772: } > 1773: _packset.trunc_to(j); > 1774: } Note: last use was removed. src/hotspot/share/opto/superword.cpp line 1795: > 1793: } > 1794: } > 1795: } Note: now done for each pack directly inside `combine_pairs_to_longer_packs` with `packset.add_pack`. src/hotspot/share/opto/superword.cpp line 1970: > 1968: _packset.print_pack(pack); > 1969: assert(false, "pack not profitable"); > 1970: } Just added the `implemented` and `profitable` checks for good measure. src/hotspot/share/opto/superword.cpp line 3441: > 3439: } > 3440: return false; > 3441: } Note: replaced with `pairset.has_pair`. Used to be search over packset, now constant time lookup. src/hotspot/share/opto/superword.cpp line 3452: > 3450: } > 3451: _packset.at_put(pos, nullptr); > 3452: } Note: last use was removed. src/hotspot/share/opto/superword.cpp line 3471: > 3469: n = max_swap_index; > 3470: } while (n > 1); > 3471: } Note: sort not needed any more with new combine method, see explanation above. src/hotspot/share/opto/superword.cpp line 3734: > 3732: } > 3733: } > 3734: } Note: walking the chains allows us to nicely display the pairs that belong together without need for sort. Printing all pairs individually was extremely verbose, and it was hard to figure out what pairs were later chained. Printing pack-chains now looks like printing packsets, and the transition is smoother that way. src/hotspot/share/opto/superword.cpp line 3868: > 3866: #endif > 3867: } > 3868: Note: this used to display the `alignment` next to all nodes in the packs. I plan to remove `alignment` for all nodes soon, and since I now am changing the printing code, I now print the position in the pack instead. src/hotspot/share/opto/superword.hpp line 62: > 60: // The PairSet is a set of pairs. These are later combined to packs, > 61: // and stored in the PackSet. > 62: class PairSet : public StackObj { Note: new class. src/hotspot/share/opto/superword.hpp line 67: > 65: public: > 66: int _alignment; // memory alignment for a node > 67: Node_List* _my_pack; // pack containing this node Note: is now in `PackSet`. src/hotspot/share/opto/superword.hpp line 72: > 70: > 71: // List of all left elements bb_idx, in the order of pair addition. > 72: GrowableArray _lefts_in_insertion_order; Note: It turns out that the pair-extension is quite sensitive to the order in which pairs are found. Keeping the insertion order does something similar to a BFS. If this order is changed, then some of our examples don't pack properly. Well, they pack correctly on a local level, but then that prevents the proper packing further on. Maybe I can remove the need for this order in the future, but for now I'll keep the order, the same order as the `_packset`'s pairs had. src/hotspot/share/opto/superword.hpp line 106: > 104: }; > 105: > 106: class PairSetIterator : public StackObj { Note: new class. used to easily iterate over `PairSet`. src/hotspot/share/opto/superword.hpp line 137: > 135: }; > 136: > 137: class SplitTask { Note: moved it out of SuperWord. No changes made. src/hotspot/share/opto/superword.hpp line 182: > 180: }; > 181: > 182: class SplitStatus { Note: moved it out of SuperWord. No changes made. src/hotspot/share/opto/superword.hpp line 224: > 222: const GrowableArray& packset() const { return _packset; } > 223: private: > 224: bool _race_possible; // In cases where SDMU is true Note: removed it. Explanation elsewhere. src/hotspot/share/opto/superword.hpp line 226: > 224: }; > 225: > 226: class PackSet : public StackObj { Note: new class. src/hotspot/share/opto/superword.hpp line 298: > 296: > 297: // Combine packs A and B with A.last == B.first into A.first..,A.last,B.second,..B.last > 298: void combine_pairs_to_longer_packs(); Note: moved and renamed some. src/hotspot/share/opto/superword.hpp line 300: > 298: void combine_pairs_to_longer_packs(); > 299: > 300: class SplitTask { Note: class moved out of SuperWord. src/hotspot/share/opto/superword.hpp line 345: > 343: }; > 344: > 345: class SplitStatus { Note: class moved out of SuperWord. src/hotspot/share/opto/superword.hpp line 391: > 389: SplitStatus split_pack(const char* split_name, Node_List* pack, SplitTask task); > 390: template > 391: void split_packs(const char* split_name, SplitStrategy strategy); Note: moved to `PackSet` src/hotspot/share/opto/superword.hpp line 401: > 399: void filter_packs(const char* filter_name, > 400: const char* error_message, > 401: FilterPredicate filter); Note: moved to `PackSet` src/hotspot/share/opto/superword.hpp line 411: > 409: void compress_packset(); > 410: // Construct the map from nodes to packs. > 411: void construct_my_pack_map(); Note: removed src/hotspot/share/opto/superword.hpp line 449: > 447: void initialize_node_info(); > 448: // Compute max depth for expressions from beginning of block > 449: void compute_max_depth(); Note: removed. src/hotspot/share/opto/superword.hpp line 457: > 455: bool in_packset(Node* s1, Node* s2); > 456: // Remove the pack at position pos in the packset > 457: void remove_pack_at(int pos); Note: removed src/hotspot/share/opto/superword.hpp line 464: > 462: void adjust_pre_loop_limit_to_align_main_loop_vectors(); > 463: // Is the use of d1 in u1 at the same operand position as d2 in u2? > 464: bool opnd_positions_match(Node* d1, Node* u1, Node* d2, Node* u2); Note: renamed, and moved up: `order_inputs_of_uses_to_match_def_pair` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1524988862 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1524990361 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1524991799 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1524995044 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1524997880 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525000012 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1526142885 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1526142923 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525002107 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525003343 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1531978254 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525003978 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525007880 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525008557 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525010591 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525015661 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525011783 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525012913 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525016402 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525018069 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525018510 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525019238 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525021056 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525023674 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525025490 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525031457 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1531982329 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525030409 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525024562 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525024785 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525032262 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525025278 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525040328 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525033432 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525033588 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525037708 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525037094 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525036685 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525036284 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525036109 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525035777 From epeter at openjdk.org Wed Mar 20 15:08:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 15:08:55 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 14:41:24 GMT, Emanuel Peter wrote: >> I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. >> >> Most importantly: I split it into two classes: `PairSet` and `PackSet`. >> `combine_pairs_to_longer_packs` converts the first into the second. >> >> I was able to simplify the combining, and remove the pack-sorting. >> I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. >> >> I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. >> >> I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: >> Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). >> >> But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. >> >> More details are described in the annotations in the code. > > src/hotspot/share/opto/superword.cpp line 1253: > >> 1251: // 1. Reduction >> 1252: if (is_marked_reduction(use1) && is_marked_reduction(use2)) { >> 1253: Node* first = use1->in(2); > > Note: while this refactoring here was not strictly necessary, I still took the time to improve the comments a bit. I needed to understand this anyway for the `_race_possible` removal. Note: I also added some more cases for MulAddS2I test, reflecting some of the swaps in this method. > src/hotspot/share/opto/superword.hpp line 226: > >> 224: }; >> 225: >> 226: class PackSet : public StackObj { > > Note: new class. Nice about this new class is that I can now make some methods `private`, e.g.: - `set_pack` (used to be `set_my_pack`) - `split_pack`, is used inside public method `split_packs` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1531987178 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1525028944 From rehn at openjdk.org Wed Mar 20 16:22:34 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 20 Mar 2024 16:22:34 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol Message-ID: Hi, please consider. [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. Tested with gcc and clang, and llvm and binutils backend. I didn't find any use of the "DLL_ENTRY", so I removed it. Thanks, Robbin ------------- Commit messages: - export symbols Changes: https://git.openjdk.org/jdk/pull/18400/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18400&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328614 Stats: 17 lines in 4 files changed: 10 ins; 2 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/18400.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18400/head:pull/18400 PR: https://git.openjdk.org/jdk/pull/18400 From epeter at openjdk.org Wed Mar 20 16:40:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 16:40:21 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 07:10:30 GMT, Christian Hagedorn wrote: > This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. > > #### How `create_bool_from_template_assertion_predicate()` Works > Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: > 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): > https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 > > 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. > > #### Missing Visited Set > The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: > > > ... > | > E > | > D > / \ > B C > \ / > A > > DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... > > With each diamond, the number of revisits of each node above doubles. > > #### Endless DFS in Edge-Cases > In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because background compilation is disabled). > > #### New DFS Implem... Nice work, looks good :) I have a few initial comments, and will look at it in more detail tomorrow. src/hotspot/share/opto/loopnode.hpp line 1662: > 1660: ParsePredicateSuccessProj* fast_loop_parse_predicate_proj, > 1661: ParsePredicateSuccessProj* slow_loop_parse_predicate_proj); > 1662: IfProjNode* clone_assertion_predicate_for_unswitched_loops(Node* template_assertion_predicate, IfProjNode* predicate, Suggestion: IfProjNode* clone_assertion_predicate_for_unswitched_loops(IfNode* template_assertion_predicate, IfProjNode* predicate, Could we improve the type? I think all uses have `IfNode*`. src/hotspot/share/opto/loopnode.hpp line 1890: > 1888: > 1889: // Interface to transform OpaqueLoopInit and OpaqueLoopStride nodes of a Template Assertion Predicate Expression. > 1890: class TransformStrategyForOpaqueLoopNodes : public StackObj { The name could reflect that it is only for template assertion predicates. test/hotspot/jtreg/compiler/predicates/TestCloningWithManyDiamondsInExpression.java line 37: > 35: * -XX:CompileCommand=compileonly,*TestCloningWithManyDiamondsInExpression::test* > 36: * -XX:CompileCommand=inline,*TestCloningWithManyDiamondsInExpression::create* > 37: * compiler.predicates.TestCloningWithManyDiamondsInExpression You could add a run with fewer / no flags. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18293#pullrequestreview-1949397286 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1532394122 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1532397347 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1532431904 From epeter at openjdk.org Wed Mar 20 16:40:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 16:40:22 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If In-Reply-To: References: Message-ID: <130ezssvWCgkOjqeun4yPh5X8ypdumhU1uQLfkW9DV8=.4dbd2fd8-1f0d-4487-bd25-536b28084f32@github.com> On Wed, 20 Mar 2024 16:18:08 GMT, Emanuel Peter wrote: >> This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. >> >> #### How `create_bool_from_template_assertion_predicate()` Works >> Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: >> 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): >> https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 >> >> 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. >> >> #### Missing Visited Set >> The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: >> >> >> ... >> | >> E >> | >> D >> / \ >> B C >> \ / >> A >> >> DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... >> >> With each diamond, the number of revisits of each node above doubles. >> >> #### Endless DFS in Edge-Cases >> In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because... > > src/hotspot/share/opto/loopnode.hpp line 1890: > >> 1888: >> 1889: // Interface to transform OpaqueLoopInit and OpaqueLoopStride nodes of a Template Assertion Predicate Expression. >> 1890: class TransformStrategyForOpaqueLoopNodes : public StackObj { > > The name could reflect that it is only for template assertion predicates. Maybe put it in the scope of `TemplateAssertionPredicateExpression` class? Not sure about this idea yet, just an idea. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1532399370 From epeter at openjdk.org Wed Mar 20 16:40:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Mar 2024 16:40:22 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If In-Reply-To: <130ezssvWCgkOjqeun4yPh5X8ypdumhU1uQLfkW9DV8=.4dbd2fd8-1f0d-4487-bd25-536b28084f32@github.com> References: <130ezssvWCgkOjqeun4yPh5X8ypdumhU1uQLfkW9DV8=.4dbd2fd8-1f0d-4487-bd25-536b28084f32@github.com> Message-ID: On Wed, 20 Mar 2024 16:19:27 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.hpp line 1890: >> >>> 1888: >>> 1889: // Interface to transform OpaqueLoopInit and OpaqueLoopStride nodes of a Template Assertion Predicate Expression. >>> 1890: class TransformStrategyForOpaqueLoopNodes : public StackObj { >> >> The name could reflect that it is only for template assertion predicates. > > Maybe put it in the scope of `TemplateAssertionPredicateExpression` class? Not sure about this idea yet, just an idea. Should this not be in `predicates.hpp`, together with its implementations? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1532419974 From sjayagond at openjdk.org Wed Mar 20 17:44:33 2024 From: sjayagond at openjdk.org (Sidraya Jayagond) Date: Wed, 20 Mar 2024 17:44:33 GMT Subject: RFR: 8328633: s390x: Improve vectorization of Match.sqrt() on floats Message-ID: [JDK-8190800](https://bugs.openjdk.org/browse/JDK-8190800) added `VSqrtF` and `SqrtF` nodes to support the vectorization of Match.sqrt() on floats. For s390x port, however, the scalar version of `sqrtF` still uses the old match rule that converts Float to Double first. It can be simplified to just use `SqrtF`. The old match rule also affects the vectorization of Math.sqrt() on float. ------------- Commit messages: - s390x: Math.sqrt() does not vectroized on floats Changes: https://git.openjdk.org/jdk/pull/18406/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18406&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328633 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18406.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18406/head:pull/18406 PR: https://git.openjdk.org/jdk/pull/18406 From kvn at openjdk.org Wed Mar 20 18:25:21 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 20 Mar 2024 18:25:21 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 12:17:03 GMT, Roland Westrelin wrote: > Both failures occur because `ABS(scale * stride_con)` overflows (scale > a really large long number). I reworked the test so overflow is no > longer an issue. src/hotspot/share/opto/loopnode.cpp line 1110: > 1108: if (loop->is_range_check_if(if_proj, this, T_LONG, phi, range, offset, scale) && > 1109: loop->is_invariant(range) && loop->is_invariant(offset) && > 1110: original_iters_limit / ABS(scale) >= min_iters * ABS(stride_con)) { I assume there is check somewhere that `stride_con` is not `MIN_INT`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1532588949 From dlong at openjdk.org Wed Mar 20 18:52:22 2024 From: dlong at openjdk.org (Dean Long) Date: Wed, 20 Mar 2024 18:52:22 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 18:23:08 GMT, Vladimir Kozlov wrote: >> Both failures occur because `ABS(scale * stride_con)` overflows (scale >> a really large long number). I reworked the test so overflow is no >> longer an issue. > > src/hotspot/share/opto/loopnode.cpp line 1110: > >> 1108: if (loop->is_range_check_if(if_proj, this, T_LONG, phi, range, offset, scale) && >> 1109: loop->is_invariant(range) && loop->is_invariant(offset) && >> 1110: original_iters_limit / ABS(scale) >= min_iters * ABS(stride_con)) { > > I assume there is check somewhere that `stride_con` is not `MIN_INT`. In my opinion ABS() should assert that it has legal input (not MIN_INT) and output (non-negative value) in debug builds. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1532629022 From mli at openjdk.org Wed Mar 20 18:54:21 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 20 Mar 2024 18:54:21 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v6] In-Reply-To: References: <1T0FRJWu0I5-AHFhl03RtTdTNQRR_rN9zZBra3TDuZ8=.926b109c-a23f-4a38-aea1-3a7862c3fadb@github.com> Message-ID: On Wed, 20 Mar 2024 12:50:53 GMT, Emanuel Peter wrote: >> Thanks for the suggestion. >> Unfortunately, I don't have access to a machine with AVX512, but I do run with a aarch64 via qemu where max vector size > 16, and it works with "-XX:MaxVectorSize=16". >> The reason why the previous test failed (which I fixed in previous commit) with "-XX:MaxVectorSize=8", is because in test framework, it checks the length of vector and make sure it > length of double(8 bytes), i.e. at least 2*(length of Double). > > Can you limit the IR rules, so that the IR rules are only checked if the MaxVectorSize is large enough? > Example you can look for: `applyIf = {"MaxVectorSize", ">=32"}` > > I think you would just have a `applyIf = {"MaxVectorSize", ">=16"}`. > I will run the tests on our Oracle machines for AVX512, so don't worry about testing that. > Maybe you could also simulate a 64 byte register machine with SVE over qemu, but I leave that up to you. Thanks for the suggestion, it's done. I also tested with "-XX:MaxVectorSize=64" on aarch64 via qemu, it works. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1532632382 From mli at openjdk.org Wed Mar 20 19:11:34 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 20 Mar 2024 19:11:34 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v9] In-Reply-To: References: Message-ID: <8NTtFb2VzNSiEVMTzHz0An84ZlpYqYTK0n7gMZyfZOE=.71abbac7-9284-4231-b4ca-b15c4c407424@github.com> > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: add more tests; add more IR filter for Double tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/8890cffd..962b40e8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=07-08 Stats: 77 lines in 2 files changed: 75 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From mli at openjdk.org Wed Mar 20 19:11:34 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 20 Mar 2024 19:11:34 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v8] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 13:03:48 GMT, Emanuel Peter wrote: > > For inf, the probability is 2 in all runs (+inf, -inf); for NaN, the probability is (eStep/eBound) (but depends on rand value), in new version I modify it to make sure it tests NaN, so it's (eStep/eBound) now. > > Just my experience, for Math.round(float/double), special cases like inf/NaN are not the error-prone ones, normal values are more error-prone. > > Maybe it is not super error-prone, but you never know. I've seen multiple bugs with other float-operations, so now it always makes me nervous when I don't see them. > > Can you please throw in these special cases explicitly, so we can be sure we have them? I think that the significand/exponent combination works now, but I'm always a bit scared that we still have an error, and we don't see it. That is why I think it would be nice to have at least one very clean random number that is thrown it as well, that has no modifications to it. Sure, I added some special case explicitly. > > But your generation method clearly has advantages: we systematically cover all ranges. That is good. > > > Thanks for the work :) Thanks for the detailed review :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/17753#issuecomment-2010403870 From dlong at openjdk.org Thu Mar 21 02:42:23 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 21 Mar 2024 02:42:23 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v7] In-Reply-To: References: Message-ID: <-xxO4c_DQN6d1OPXISLhO-nAQ9-rNKKv8F7XMDXlZes=.6af80818-d0c7-4957-82f0-7498187d64cd@github.com> On Wed, 20 Mar 2024 09:04:36 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. I only wanted the comments around the boilerplate force_reexecute() logic, but if you are happy with my idea to move that logic into LIRGenerator::state_for then the comment could go there. If not, I may look at it in a follow-up RFE, because I would like to get rid of the force_reexecute() hack that I added and see if I can instead tie it to the use of state_before() or ValueStack::StateBefore. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-2011084976 From dlong at openjdk.org Thu Mar 21 03:04:23 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 21 Mar 2024 03:04:23 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v7] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 09:04:36 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. src/hotspot/share/c1/c1_GraphBuilder.cpp line 2146: > 2144: ciType* receiver_type; > 2145: if (target->get_Method()->intrinsic_id() == vmIntrinsics::_clone && > 2146: ((receiver_type = state()->stack_at(state()->stack_size() - inline_target->arg_size())->exact_type()) == nullptr || // clone target is phi I don't think target-specific logic belongs here. And I don't understand the point about Phi nodes. Isn't the holder_known flag enough? For primitive arrays, isn't it true that inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17667#discussion_r1533190119 From amitkumar at openjdk.org Thu Mar 21 04:42:20 2024 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 21 Mar 2024 04:42:20 GMT Subject: RFR: 8328633: s390x: Improve vectorization of Match.sqrt() on floats In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 17:39:09 GMT, Sidraya Jayagond wrote: > [JDK-8190800](https://bugs.openjdk.org/browse/JDK-8190800) added `VSqrtF` and `SqrtF` nodes to support the vectorization of Match.sqrt() on floats. For s390x port, however, the scalar version of `sqrtF` still uses the old match rule that converts Float to Double first. It can be simplified to just use `SqrtF`. > > The old match rule also affects the vectorization of Math.sqrt() on float. LGTM. I have test "tier1 on fastdebug" results are clean. ------------- Marked as reviewed by amitkumar (Committer). PR Review: https://git.openjdk.org/jdk/pull/18406#pullrequestreview-1950745302 From duke at openjdk.org Thu Mar 21 05:35:43 2024 From: duke at openjdk.org (Joshua Cao) Date: Thu, 21 Mar 2024 05:35:43 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v12] In-Reply-To: References: Message-ID: <4mDyuFKAjy0QAzc9wlF0a-0C1dPvFFxDEW4CMId9s1U=.54bf4235-87c5-42ed-965b-a6332161f794@github.com> > // inv1 == (x + inv2) => ( inv1 - inv2 ) == x > // inv1 == (x - inv2) => ( inv1 + inv2 ) == x > // inv1 == (inv2 - x) => (-inv1 + inv2 ) == x > > > For example, > > > fn(inv1, inv2) > while(...) > x = foobar() > if inv1 == x + inv2 > blackhole() > > > We can transform this into > > > fn(inv1, inv2) > t = inv1 - inv2 > while(...) > x = foobar() > if t == x > blackhole() > > > Here is an example: https://github.com/openjdk/jdk/blob/b78896b9aafcb15f453eaed6e154a5461581407b/src/java.base/share/classes/java/lang/invoke/LambdaFormEditor.java#L910. LHS `1` and RHS `pos` are both loop invariant > > Passes tier1 locally on Linux machine. Passes GHA on my fork. Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: - Add tests for add/sub reassociation - Merge branch 'master' into licm - Make inputs deterministic. Make size an arg. Fix comments. Formatting. - Update test to utilize @setup method for arguments - Merge branch 'master' into licm - Add correctness test for some random tests with random inputs - Add some correctness tests where we do reassociate - Remove unused TestInfo parameter. Have some tests exit mid-loop. - Merge branch 'master' into licm - Small fixes and add check methods for tests - ... and 5 more: https://git.openjdk.org/jdk/compare/722ceb32...b151293d ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17375/files - new: https://git.openjdk.org/jdk/pull/17375/files/3e573b08..b151293d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17375&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17375&range=10-11 Stats: 413570 lines in 2936 files changed: 34155 ins; 88370 del; 291045 mod Patch: https://git.openjdk.org/jdk/pull/17375.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17375/head:pull/17375 PR: https://git.openjdk.org/jdk/pull/17375 From duke at openjdk.org Thu Mar 21 05:35:43 2024 From: duke at openjdk.org (Joshua Cao) Date: Thu, 21 Mar 2024 05:35:43 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v5] In-Reply-To: References: <0mSC33e8Dm1pwOo_xlx48AwfkB1C9ZNIVqD8UdSW07U=.866a7c2a-59cf-4bab-8bda-dcd8a3f337de@github.com> Message-ID: On Fri, 1 Mar 2024 05:41:14 GMT, Emanuel Peter wrote: >>> One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? >> >> Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. >> >> I ran `make CONF=linux-x86_64-server-fastdebug test TEST=all TEST_VM_OPTS=-XX:-TieredCompilation` on my Linux machine. I have 4 failures in `SctpChannel` and 3 failures in `CAInterop.java`, but they also fail on master branch so they should not be caused by this patch. Hopefully this adds a little more confidence. > > @caojoshua > > I also ran our internal testing and it looks ok (only unrelated failures). But of course that is only on tests that we have, and if the other reassociations are not tested, then that helps little ;) > >> Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. > > Could you please add a result verification test per case of pre-existing reassociation? Otherwise I'm afraid it is hard to be sure you did not break those cases. @eme64 I added a bunch of tests for add/sub reassociation. I made them IR tests since there are some cases where the IR matching is interesting. I added node matching to each for completeness. But I'd say more importantly they test correctness. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17375#issuecomment-2011252099 From dholmes at openjdk.org Thu Mar 21 06:05:19 2024 From: dholmes at openjdk.org (David Holmes) Date: Thu, 21 Mar 2024 06:05:19 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 16:17:36 GMT, Robbin Ehn wrote: > Hi, please consider. > > [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. > Tested with gcc and clang, and llvm and binutils backend. > > I didn't find any use of the "DLL_ENTRY", so I removed it. > > Thanks, Robbin src/hotspot/cpu/x86/.nativeInst_x86.cpp.swp line 1: > 1: b0VIM 9.0E??e????rehnrehn-black~rehn/source/jdk/vanilla/src/hotspot/cpu/x86/nativeInst_x86.cpp This file should not have been committed :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18400#discussion_r1533288984 From thartmann at openjdk.org Thu Mar 21 06:48:19 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 21 Mar 2024 06:48:19 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:41:29 GMT, Koichi Sakata wrote: > This is a trivial change to remove an extra whitespace. > > A double whitespace is printed because method->print_short_name already adds a whitespace before the name. > > ### Test > > For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. > > > @Test > @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) > @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) > public static void test15(Object o) { > > > This change was only for testing, so I reverted back to the original code after the test. > > #### Execution Result > > Before the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" > - Failed comparison: [found] 1 = 11 [given] > - Matched node: > * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) > > > After the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Cons... Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18181#pullrequestreview-1950870081 From rehn at openjdk.org Thu Mar 21 06:55:20 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 21 Mar 2024 06:55:20 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 06:02:56 GMT, David Holmes wrote: >> Hi, please consider. >> >> [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. >> Tested with gcc and clang, and llvm and binutils backend. >> >> I didn't find any use of the "DLL_ENTRY", so I removed it. >> >> Thanks, Robbin > > src/hotspot/cpu/x86/.nativeInst_x86.cpp.swp line 1: > >> 1: b0VIM 9.0E??e????rehnrehn-black~rehn/source/jdk/vanilla/src/hotspot/cpu/x86/nativeInst_x86.cpp > > This file should not have been committed :) Ops, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18400#discussion_r1533324714 From rehn at openjdk.org Thu Mar 21 06:58:43 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 21 Mar 2024 06:58:43 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: > Hi, please consider. > > [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. > Tested with gcc and clang, and llvm and binutils backend. > > I didn't find any use of the "DLL_ENTRY", so I removed it. > > Thanks, Robbin Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: remove swap file ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18400/files - new: https://git.openjdk.org/jdk/pull/18400/files/65e3cb17..28862745 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18400&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18400&range=00-01 Stats: 0 lines in 1 file changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18400.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18400/head:pull/18400 PR: https://git.openjdk.org/jdk/pull/18400 From epeter at openjdk.org Thu Mar 21 08:15:28 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 08:15:28 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: <74om94T5LWhErz63zxn_FIh-0Q_aPrSefP-o2FzpllA=.6395fed9-2098-4688-97f2-354fa53239cd@github.com> On Wed, 13 Mar 2024 14:25:57 GMT, Emanuel Peter wrote: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. src/hotspot/share/opto/superword.cpp line 1047: > 1045: Node* s2 = _pairset.right_at(i); > 1046: changed |= extend_pairset_with_more_pairs_by_following_def(s1, s2); > 1047: changed |= extend_pairset_with_more_pairs_by_following_use(s1, s2); Note: could not use the `PairSetIterator`, it changed the order of extension. Keeping the old insertion order leads to something closer to a BFS, which seems to be more successful on the IR tests we have. That does not mean that this is a optimal solution. After all, SuperWord is a greedy algoritm. I hope to implement something more optimal in the future. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533404492 From epeter at openjdk.org Thu Mar 21 08:19:33 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 08:19:33 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 14:25:57 GMT, Emanuel Peter wrote: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. src/hotspot/share/opto/superword.cpp line 1199: > 1197: if (num_s1_uses > 1) { > 1198: _race_possible = true; > 1199: } Note: I removed `_race_possible` and `num_s1_uses`. We checked here if there was any node in a pack that has multiple uses. If that happens, it it spossible that `order_inputs_of_uses_to_match_def_pair` changes the order of inputs. This flag was set to fix everything again afterwards. But now I just call that algorithm always, implicitly always setting `_race_possible = true`. This does not cost so much, and makes things quite a bit simpler. src/hotspot/share/opto/superword.cpp line 1309: > 1307: auto adjacent_profit = [&] (Node* s1, Node* s2) { return 2; }; > 1308: auto pack_cost = [&] (int ct) { return ct; }; > 1309: auto unpack_cost = [&] (int ct) { return ct; }; Note: I moved the methods from the `SuperWord` class to lambdas in this method, they were not used anywhere else. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533408297 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533409251 From epeter at openjdk.org Thu Mar 21 08:29:33 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 08:29:33 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 14:25:57 GMT, Emanuel Peter wrote: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. src/hotspot/share/opto/superword.hpp line 101: > 99: int length() const { return _lefts_in_insertion_order.length(); } > 100: Node* left_at(int i) const { return _body.body().at(_lefts_in_insertion_order.at(i)); } > 101: Node* right_at(int i) const { return _body.body().at(get_right_for(_lefts_in_insertion_order.at(i))); } Note: I hope to get rid of `_lefts_in_insertion_order` eventually, and then these accessors will disappear too. But for now I need them to iterate in insertion order, doing something similar to a DFS in the pair extension. src/hotspot/share/opto/superword.hpp line 250: > 248: Node_List* my_pack(const Node* n) const { return !in_bb(n) ? nullptr : _node_info.adr_at(bb_idx(n))->_my_pack; } > 249: private: > 250: void set_my_pack(Node* n, Node_List* p) { int i = bb_idx(n); grow_node_info(i); _node_info.adr_at(i)->_my_pack = p; } Note: replaced with `PackSet.pack` and `PackSet.set_pack`. src/hotspot/share/opto/superword.hpp line 271: > 269: bool stmts_can_pack(Node* s1, Node* s2, int align); > 270: // Does s exist in a pack at position pos? > 271: bool exists_at(Node* s, uint pos); Note: replaced with `PairSet.has_left/right`. src/hotspot/share/opto/superword.hpp line 293: > 291: void set_pack(const Node* n, Node_List* pack) { _node_to_pack.at_put(_body.bb_idx(n), pack); } > 292: public: > 293: Node_List* pack(const Node* n) const { return !_vloop.in_bb(n) ? nullptr : _node_to_pack.at(_body.bb_idx(n)); } Note: replacement for `my_pack`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533416428 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533420534 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533421120 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533418322 From mdoerr at openjdk.org Thu Mar 21 08:31:19 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 21 Mar 2024 08:31:19 GMT Subject: RFR: 8328633: s390x: Improve vectorization of Match.sqrt() on floats In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 17:39:09 GMT, Sidraya Jayagond wrote: > [JDK-8190800](https://bugs.openjdk.org/browse/JDK-8190800) added `VSqrtF` and `SqrtF` nodes to support the vectorization of Match.sqrt() on floats. For s390x port, however, the scalar version of `sqrtF` still uses the old match rule that converts Float to Double first. It can be simplified to just use `SqrtF`. > > The old match rule also affects the vectorization of Math.sqrt() on float. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18406#pullrequestreview-1951178078 From epeter at openjdk.org Thu Mar 21 08:32:26 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 08:32:26 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 14:25:57 GMT, Emanuel Peter wrote: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. test/hotspot/jtreg/compiler/loopopts/superword/TestMulAddS2I.java line 167: > 165: // Unrolled, with the same structure. > 166: out[i+0] += ((sArr1[2*i+0] * sArr2[2*i+0]) + (sArr1[2*i+1] * sArr2[2*i+1])); > 167: out[i+1] += ((sArr1[2*i+2] * sArr2[2*i+2]) + (sArr1[2*i+3] * sArr2[2*i+3])); Note: extra cases to test the `MulAddS2I` input ordering in `order_inputs_of_uses_to_match_def_pair`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1533424403 From sjayagond at openjdk.org Thu Mar 21 08:51:20 2024 From: sjayagond at openjdk.org (Sidraya Jayagond) Date: Thu, 21 Mar 2024 08:51:20 GMT Subject: RFR: 8328633: s390x: Improve vectorization of Match.sqrt() on floats In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 17:39:09 GMT, Sidraya Jayagond wrote: > [JDK-8190800](https://bugs.openjdk.org/browse/JDK-8190800) added `VSqrtF` and `SqrtF` nodes to support the vectorization of Match.sqrt() on floats. For s390x port, however, the scalar version of `sqrtF` still uses the old match rule that converts Float to Double first. It can be simplified to just use `SqrtF`. > > The old match rule also affects the vectorization of Math.sqrt() on float. Thanks for the reviews. test failures are not related to this change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18406#issuecomment-2011658287 From sjayagond at openjdk.org Thu Mar 21 08:54:24 2024 From: sjayagond at openjdk.org (Sidraya Jayagond) Date: Thu, 21 Mar 2024 08:54:24 GMT Subject: Integrated: 8328633: s390x: Improve vectorization of Match.sqrt() on floats In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 17:39:09 GMT, Sidraya Jayagond wrote: > [JDK-8190800](https://bugs.openjdk.org/browse/JDK-8190800) added `VSqrtF` and `SqrtF` nodes to support the vectorization of Match.sqrt() on floats. For s390x port, however, the scalar version of `sqrtF` still uses the old match rule that converts Float to Double first. It can be simplified to just use `SqrtF`. > > The old match rule also affects the vectorization of Math.sqrt() on float. This pull request has now been integrated. Changeset: 684678f9 Author: Sidraya Jayagond Committer: Amit Kumar URL: https://git.openjdk.org/jdk/commit/684678f9e83ed0a76541a31356894d170fd421db Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8328633: s390x: Improve vectorization of Match.sqrt() on floats Reviewed-by: amitkumar, mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/18406 From epeter at openjdk.org Thu Mar 21 10:29:25 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 10:29:25 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: <6KLRg7UgrEMNOU71aVTF1Pka972NReqA_wOynzEipHE=.f3e38aea-d0c0-4f09-8849-96398152ef6a@github.com> References: <6KLRg7UgrEMNOU71aVTF1Pka972NReqA_wOynzEipHE=.f3e38aea-d0c0-4f09-8849-96398152ef6a@github.com> Message-ID: On Mon, 18 Mar 2024 17:29:45 GMT, Andrew Haley wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Naming changes: replace strict/non-strict with more technical terms > > src/hotspot/share/opto/vectornode.hpp line 240: > >> 238: // >> 239: // Other reductions are associative (do not need strict ordering). >> 240: virtual bool is_associative() const { > > I think this flag may be badly named. The idea you want to express is not so much associativity, but whether such nodes should be treated as strictly ordered. It would be much less confusing to pick a name like ordered() because that describes what you want to the node to do. I think you are right. Maybe we can use `ordered/unordered`. I just want to make sure we don't have too many synonyms (ordered, unordered, associative, strictly ordered, etc). But I guess associative and ordered is not exactly a synonym. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1533600124 From epeter at openjdk.org Thu Mar 21 10:29:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 10:29:23 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 20:48:27 GMT, Bhavana Kilambi wrote: >> I'm not super familiar with the Vector API, but I could not see that MUL is not associative. > > Yes, MUL is non-associative in VectorAPI just like ADD operation (according to the description here - https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc). > > We found a significant perf difference between the SVE "fadda" instruction which is a strictly ordered instruction vs Neon instructions on a 128-bit SVE machine especially after this optimization - https://bugs.openjdk.org/browse/JDK-8298244 but there's no such performance difference for the MUL operation. MulReductionVF/VD do not have direct instructions for multiply reduction nor do they have separate ISA for strictly ordered or non-strictly ordered. So, currently we do not have any data that shows any benefit to add similar code for MUL and thus it's currently considered to be a non-associative operation (strictly ordered). I am not sure about other platforms. Right. Ok, since your benchmarks are restiricted to NEON/SVE, I can understand these results. But I would think that probably on x86 machines this would look different, it is just that we currently have no unordered float/double add/mul reductions. I think it would be nice if you made both Add and Mul capable of being unordered already, that would make future work in this area simpler. Or do you see a regression for unordered mul reductions on your benchmark machines? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1533604068 From bkilambi at openjdk.org Thu Mar 21 10:59:22 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 21 Mar 2024 10:59:22 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: <6KLRg7UgrEMNOU71aVTF1Pka972NReqA_wOynzEipHE=.f3e38aea-d0c0-4f09-8849-96398152ef6a@github.com> Message-ID: On Thu, 21 Mar 2024 10:23:19 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/vectornode.hpp line 240: >> >>> 238: // >>> 239: // Other reductions are associative (do not need strict ordering). >>> 240: virtual bool is_associative() const { >> >> I think this flag may be badly named. The idea you want to express is not so much associativity, but whether such nodes should be treated as strictly ordered. It would be much less confusing to pick a name like ordered() because that describes what you want to the node to do. > > I think you are right. Maybe we can use `ordered/unordered`. I just want to make sure we don't have too many synonyms (ordered, unordered, associative, strictly ordered, etc). But I guess associative and ordered is not exactly a synonym. Hi, thank you for your comments on this. I personally also feel "is_associative" is a bit non-intuitive as in the reader might have to make the connection between "associativity" and "ordering" compared to the case where we directly use what we intend the variable to do, something like "is_ordered". Would it be okay if I revert this to "requires_strict_order" for the variable and method names that I used in my first commit? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1533656600 From epeter at openjdk.org Thu Mar 21 11:37:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 11:37:23 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: <6KLRg7UgrEMNOU71aVTF1Pka972NReqA_wOynzEipHE=.f3e38aea-d0c0-4f09-8849-96398152ef6a@github.com> Message-ID: <5x-T1UNFkXz3aN0Zef-2h5NK2BrSuqO5NzT02-Hv3Vg=.3d6357d4-f525-4171-9641-40231f5431be@github.com> On Thu, 21 Mar 2024 10:56:40 GMT, Bhavana Kilambi wrote: >> I think you are right. Maybe we can use `ordered/unordered`. I just want to make sure we don't have too many synonyms (ordered, unordered, associative, strictly ordered, etc). But I guess associative and ordered is not exactly a synonym. > > Hi, thank you for your comments on this. I personally also feel "is_associative" is a bit non-intuitive as in the reader might have to make the connection between "associativity" and "ordering" compared to the case where we directly use what we intend the variable to do, something like "is_ordered". Would it be okay if I revert this to "requires_strict_order" for the variable and method names that I used in my first commit? Yes, I think that would be ok. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1533719899 From epeter at openjdk.org Thu Mar 21 11:38:26 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 11:38:26 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v8] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 02:05:56 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review and add IR test for vectorization and reduction Excellent, I think your patch now looks good! I'm running our internal testing once more. Feel free to ping me in 2 days if I don't report back until then. ------------- PR Review: https://git.openjdk.org/jdk/pull/17574#pullrequestreview-1951933489 From epeter at openjdk.org Thu Mar 21 12:02:27 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 12:02:27 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v12] In-Reply-To: <4mDyuFKAjy0QAzc9wlF0a-0C1dPvFFxDEW4CMId9s1U=.54bf4235-87c5-42ed-965b-a6332161f794@github.com> References: <4mDyuFKAjy0QAzc9wlF0a-0C1dPvFFxDEW4CMId9s1U=.54bf4235-87c5-42ed-965b-a6332161f794@github.com> Message-ID: On Thu, 21 Mar 2024 05:35:43 GMT, Joshua Cao wrote: >> // inv1 == (x + inv2) => ( inv1 - inv2 ) == x >> // inv1 == (x - inv2) => ( inv1 + inv2 ) == x >> // inv1 == (inv2 - x) => (-inv1 + inv2 ) == x >> >> >> For example, >> >> >> fn(inv1, inv2) >> while(...) >> x = foobar() >> if inv1 == x + inv2 >> blackhole() >> >> >> We can transform this into >> >> >> fn(inv1, inv2) >> t = inv1 - inv2 >> while(...) >> x = foobar() >> if t == x >> blackhole() >> >> >> Here is an example: https://github.com/openjdk/jdk/blob/b78896b9aafcb15f453eaed6e154a5461581407b/src/java.base/share/classes/java/lang/invoke/LambdaFormEditor.java#L910. LHS `1` and RHS `pos` are both loop invariant >> >> Passes tier1 locally on Linux machine. Passes GHA on my fork. > > Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: > > - Add tests for add/sub reassociation > - Merge branch 'master' into licm > - Make inputs deterministic. Make size an arg. Fix comments. Formatting. > - Update test to utilize @setup method for arguments > - Merge branch 'master' into licm > - Add correctness test for some random tests with random inputs > - Add some correctness tests where we do reassociate > - Remove unused TestInfo parameter. Have some tests exit mid-loop. > - Merge branch 'master' into licm > - Small fixes and add check methods for tests > - ... and 5 more: https://git.openjdk.org/jdk/compare/d6e69638...b151293d Just two little details, then we should be ready! Ping me once you updated it, and I will run our internal testing one more time before approving! test/hotspot/jtreg/compiler/loopopts/InvariantCodeMotionReassociateAddSub.java line 36: > 34: * @summary Test loop invariant code motion of add/sub through reassociation > 35: * @library /test/lib / > 36: * @run driver compiler.c2.loopopts.InvariantCodeMotionReassociateAddSub Suggestion: * @run main compiler.c2.loopopts.InvariantCodeMotionReassociateAddSub Otherwise external flags are not passed in. test/hotspot/jtreg/compiler/loopopts/InvariantCodeMotionReassociateCmp.java line 36: > 34: * @summary Test loop invariant code motion for cmp nodes through reassociation > 35: * @library /test/lib / > 36: * @run driver compiler.c2.loopopts.InvariantCodeMotionReassociateCmp Suggestion: * @run main compiler.c2.loopopts.InvariantCodeMotionReassociateCmp Just like above ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17375#pullrequestreview-1951942814 PR Review Comment: https://git.openjdk.org/jdk/pull/17375#discussion_r1533728671 PR Review Comment: https://git.openjdk.org/jdk/pull/17375#discussion_r1533760799 From epeter at openjdk.org Thu Mar 21 12:10:27 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 12:10:27 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Thu, 7 Mar 2024 08:12:46 GMT, Roland Westrelin wrote: >> @rwestrel nice! I'll run our testing again, now that it is merged. >> >> FYI: you have some whitespace issues in: >> `src/hotspot/share/opto/callGenerator.cpp` > >> FYI: you have some whitespace issues in: `src/hotspot/share/opto/callGenerator.cpp` > > Thanks. I missed it. Fixed now. @rwestrel sorry for the extremely slow reaction here. I ran testing for Commit 16, and got this failure. It does not seem to depend on the platform, get failures with x84, aarch, linux, windows. I see these two extra flag combinations lead to failures: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` `-server -Xcomp` Test Failures (1) ----------------- Custom Run Test: @Run: testFastPath15Runner - @Test: testFastPath15: compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method private void compiler.c2.irTests.TestScopedValue.testFastPath15Runner() throws java.lang.Exception at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162) at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:87) at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:861) at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:252) at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:165) Caused by: java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159) ... 4 more Caused by: java.lang.RuntimeException: should have deoptimized at compiler.c2.irTests.TestScopedValue.testFastPath15Runner(TestScopedValue.java:490) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ... 6 more at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:896) at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:252) at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:165) ############################################################# - To only run the failed tests use -DTest, -DExclude, and/or -DScenarios. - To also get the standard output of the test VM run with -DReportStdout=true or for even more fine-grained logging use -DVerbose=true. ############################################################# ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2012103186 From chagedorn at openjdk.org Thu Mar 21 13:14:27 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 21 Mar 2024 13:14:27 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Thu, 21 Mar 2024 12:07:37 GMT, Emanuel Peter wrote: > I see these two extra flag combinations lead to failures: > > `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > > `-server -Xcomp` You can try to use `TestFramework::assertDeoptimizedByC2()` which skips the assertion for some unstable setups like having `PerMethodTrapLimit == 0`: https://github.com/openjdk/jdk/blob/700d2b91defd421a2818f53830c24f70d11ba4f6/test/hotspot/jtreg/compiler/lib/ir_framework/test/TestVM.java#L943-L956 ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2012258649 From epeter at openjdk.org Thu Mar 21 13:44:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 13:44:23 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 07:10:30 GMT, Christian Hagedorn wrote: > This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. > > #### How `create_bool_from_template_assertion_predicate()` Works > Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: > 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): > https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 > > 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. > > #### Missing Visited Set > The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: > > > ... > | > E > | > D > / \ > B C > \ / > A > > DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... > > With each diamond, the number of revisits of each node above doubles. > > #### Endless DFS in Edge-Cases > In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because background compilation is disabled). > > #### New DFS Implem... This looks fantastic, great work :) src/hotspot/share/opto/loopnode.hpp line 1937: > 1935: } > 1936: > 1937: // Create a copy of the provided data node collection by doing the following: By "collection" you mean the `_data_nodes`, right? maybe some renaming could be helpful? src/hotspot/share/opto/loopopts.cpp line 4542: > 4540: > 4541: void DataNodeGraph::transform_opaque_node(TransformStrategyForOpaqueLoopNodes& transform_strategy, Node* node) { > 4542: const uint next_idx = _phase->C->unique(); Does this have any use? src/hotspot/share/opto/predicates.cpp line 185: > 183: // The class takes a node filter to decide which input nodes to follow and a target node predicate to start backtracking > 184: // from. All nodes found on all paths from source->target(s) returned in a Unique_Node_List (without duplicates). > 185: class DataNodesOnPathToTargets : public StackObj { Suggestion: class DataNodesOnPathsToTargets : public StackObj { Just a nit: but you do want all paths. src/hotspot/share/opto/predicates.cpp line 189: > 187: > 188: NodeCheck _node_filter; // Node filter function to decide if we should process a node or not while searching for targets. > 189: NodeCheck _is_target_node; // Function to decide if a node is a target node (i.e. where we should start backtracking). There should be some remark that all target nodes must pass the filter. src/hotspot/share/opto/predicates.cpp line 218: > 216: pop_target_node_and_collect_predecessor(); > 217: } else if (!push_next_unvisited_input()) { > 218: // All inputs visited. Continue backtracking. `push_next_unvisited_input`: A `enum` with 2 values would make it more explicit than a bool: `HasMoreUnvisitedInputs` and `AllInputsAreVisited`. Then it makes more sense at the use site, because you can just check for `AllInputsAreVisited`, and it is immediately clear that you can continue with backtacking. src/hotspot/share/opto/predicates.cpp line 263: > 261: for (uint i = next_unvisited_input_index; i < current->req(); i++) { > 262: Node* input = current->in(i); > 263: if (_node_filter(input)) { you could check that if the filter does not pass, then also the target-node criterion fails. src/hotspot/share/opto/predicates.cpp line 287: > 285: // Push the next unvisited node in the DFS order with index 1 since this node needs to visit all its inputs. > 286: void push_unvisited_node(Node* next_to_visit) { > 287: _stack.push(next_to_visit, 1); Add assert that it was not visited yet! src/hotspot/share/opto/predicates.cpp line 301: > 299: } > 300: } > 301: }; I think this is now correct. But it is 100 lines to perform and explain this DFS with backtracking. On the other hand doing 2 BFS would just be 20+ lines. Unique_Node_List collected; Unique_Node_List input_traversal; input_traversal.push(start_node); for (int i = 0; i < input_traversal.length(); i++) { Node* n = input_traversal.at(i); for (int j = 1; j < n->req(); j++) { Node* input = n->in(j); if (_is_target_node(input)) { collected.push(input); // mark as target, where we start backtracking. } else if(_filter(input)) { input_traversal.push(input); // continue BFS. } } } assert(!collected.is_empty(), "must find some targets"); for (int i = 0; i < collected.length(); i++) { Node* n = collected.at(i); for (output : n->fastout()) { // pseudocode if (input_traversal.contains(output)) { collected.push(output); // backtrack through nodes of input traversal } } } assert(collected.contains(start_node), "must find start node again"); src/hotspot/share/opto/predicates.cpp line 306: > 304: Opaque4Node* TemplateAssertionPredicateExpression::clone(TransformStrategyForOpaqueLoopNodes& transform_strategy, > 305: Node* new_ctrl, PhaseIdealLoop* phase) { > 306: ResourceMark rm; The ResourceMark makes me a bit nervous, in combination of a non-constant `transform_strategy`. Could the `transform_strategy` be a constant reference, maybe by making its functions also const? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18293#pullrequestreview-1952137313 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533856371 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533863020 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533878355 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533886598 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533884251 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533888249 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533895635 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533915387 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533923221 From epeter at openjdk.org Thu Mar 21 13:44:24 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 13:44:24 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 13:31:27 GMT, Emanuel Peter wrote: >> This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. >> >> #### How `create_bool_from_template_assertion_predicate()` Works >> Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: >> 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): >> https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 >> >> 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. >> >> #### Missing Visited Set >> The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: >> >> >> ... >> | >> E >> | >> D >> / \ >> B C >> \ / >> A >> >> DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... >> >> With each diamond, the number of revisits of each node above doubles. >> >> #### Endless DFS in Edge-Cases >> In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because... > > src/hotspot/share/opto/predicates.cpp line 301: > >> 299: } >> 300: } >> 301: }; > > I think this is now correct. But it is 100 lines to perform and explain this DFS with backtracking. > > On the other hand doing 2 BFS would just be 20+ lines. > > > Unique_Node_List collected; > > Unique_Node_List input_traversal; > input_traversal.push(start_node); > for (int i = 0; i < input_traversal.length(); i++) { > Node* n = input_traversal.at(i); > for (int j = 1; j < n->req(); j++) { > Node* input = n->in(j); > if (_is_target_node(input)) { > collected.push(input); // mark as target, where we start backtracking. > } else if(_filter(input)) { > input_traversal.push(input); // continue BFS. > } > } > } > assert(!collected.is_empty(), "must find some targets"); > for (int i = 0; i < collected.length(); i++) { > Node* n = collected.at(i); > for (output : n->fastout()) { // pseudocode > if (input_traversal.contains(output)) { > collected.push(output); // backtrack through nodes of input traversal > } > } > } > assert(collected.contains(start_node), "must find start node again"); But this is a matter of taste. The data structures probably have roughly the same size. And also runtime is probably basically the same. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1533917234 From epeter at openjdk.org Thu Mar 21 14:04:27 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Mar 2024 14:04:27 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store In-Reply-To: References: Message-ID: On Wed, 7 Feb 2024 18:11:52 GMT, Claes Redestad wrote: >> There already are internal APIs and `VarHandles` to enable similar optimizations, see e.g. `jdk.internal.util.ByteArray/-LittleEndian::putInt/-Long`. >> >> The very point of this RFE was to opportunistically enable similar optimizations more automatically in idiomatic java code without the need to bring out the big guns. Of course such automatic transformations will have some level of fragility and you might accidentally disable the optimization in a variety of ways (since C2 needs to draw the line somewhere) - but that's the case for many other heuristic and opportunistic optimizations. Should we not optimize counted loops or do loop unrolling because it's easy to add something that makes C2 bail out on you? >> >> Having this optimization in C2 also allows us to avoid dependencies on `VarHandles` in bootstrap sensitive code and still enable the optimization. It might also have a benefit on startup/warmup characteristics. > >> @cl4es Ok, I guess there is good motivation to keep working on this, looks like this patch here even outperforms #15990 > > Yes! A bit surprising but a great proof of concept for this optimization. I think it might be useful to analyze what the JIT is doing w.r.t inlining etc in the three variants. @cl4es @rwestrel The question is what we should now go for, I did a little thinking and here are my thoughts: We have to avoid duplication of stores. There are a few scenarios: - No control flow between the stores: simple, we just replace the old stores with the new merged one. - Smeared RangeChecks, of the form `RC[0], store[0], RC[3], store[1], store[2], store[3]`: also simple, we can replace the last store with the merged one, and let IGVN remove `store[1], store[2]`, and `store[0]` will sink into the false-path of `RC[3]`. - No RangeCheck smearing, or other CFG between the stores: `RC[0], store[0], RC[1], store[1], RC[2], store[2], RC[3], store[3]`. Not so simple. We can merge the 4 stores on the normal path, where all RC's pass. But we have to remove all old stores from that path. But the `RC[1], RC[2], RC[3]` false paths need some of those stores. So the only way I see is to duplicate all stores for the branches, so that we are sure that they sink out into the trap-paths. So: do we care about the non-RC smearing case? More precisely: do we expect that there will ever be out-of-bounds exceptions in the relevant code-patterns for which we are trying to merge the stores? Because if we hit out-of-bounds, then we trap, disable RC smearing, and then it gets complicated. But still doable, I think. What are the opinions on this? Any other suggestions? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-2012383079 From roland at openjdk.org Thu Mar 21 14:34:26 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 21 Mar 2024 14:34:26 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 14:01:37 GMT, Emanuel Peter wrote: > * No RangeCheck smearing, or other CFG between the stores: `RC[0], store[0], RC[1], store[1], RC[2], store[2], RC[3], store[3]`. Not so simple. We can merge the 4 stores on the normal path, where all RC's pass. But we have to remove all old stores from that path. But the `RC[1], RC[2], RC[3]` false paths need some of those stores. So the only way I see is to duplicate all stores for the branches, so that we are sure that they sink out into the trap-paths. I also think you need to duplicate stores. My opinion is that we want to stick with the simpler cases (your first and second bullets) unless it's obvious it doesn't cover all use cases. It's always possible to revisit the optimization down the road if it's observed that there are cases that are not covered. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-2012457389 From dlunden at openjdk.org Thu Mar 21 15:36:27 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 21 Mar 2024 15:36:27 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN Message-ID: The [`assert`](https://github.com/openjdk/jdk/blob/6f2676dc5f09d350c359f906b07f6f6d0d17f030/src/hotspot/share/opto/graphKit.cpp#L1567) added in [JDK-8310524](https://bugs.openjdk.org/browse/JDK-8310524) is too strong and may sometimes not hold due to the [GVN transformation in `LoadNode::make`](https://github.com/openjdk/jdk/blob/8cb9b479c529c058aee50f83920db650b0c18045/src/hotspot/share/opto/memnode.cpp#L973). ### Changeset Remove the `assert`. ### Testing N/A ------------- Commit messages: - Remove assert Changes: https://git.openjdk.org/jdk/pull/18434/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18434&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8326438 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18434.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18434/head:pull/18434 PR: https://git.openjdk.org/jdk/pull/18434 From kvn at openjdk.org Thu Mar 21 17:19:23 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 21 Mar 2024 17:19:23 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 15:31:58 GMT, Daniel Lund?n wrote: > The [`assert`](https://github.com/openjdk/jdk/blob/6f2676dc5f09d350c359f906b07f6f6d0d17f030/src/hotspot/share/opto/graphKit.cpp#L1567) added in [JDK-8310524](https://bugs.openjdk.org/browse/JDK-8310524) is too strong and may sometimes not hold due to the [GVN transformation in `LoadNode::make`](https://github.com/openjdk/jdk/blob/8cb9b479c529c058aee50f83920db650b0c18045/src/hotspot/share/opto/memnode.cpp#L973). > > ### Changeset > Remove the `assert`. > > ### Testing > N/A Good. > Add label to JBS something `test*hard`. I don't remember its exact name. Never mind I see you did it already. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18434#pullrequestreview-1952890909 From duke at openjdk.org Thu Mar 21 18:38:36 2024 From: duke at openjdk.org (Joshua Cao) Date: Thu, 21 Mar 2024 18:38:36 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v13] In-Reply-To: References: Message-ID: > // inv1 == (x + inv2) => ( inv1 - inv2 ) == x > // inv1 == (x - inv2) => ( inv1 + inv2 ) == x > // inv1 == (inv2 - x) => (-inv1 + inv2 ) == x > > > For example, > > > fn(inv1, inv2) > while(...) > x = foobar() > if inv1 == x + inv2 > blackhole() > > > We can transform this into > > > fn(inv1, inv2) > t = inv1 - inv2 > while(...) > x = foobar() > if t == x > blackhole() > > > Here is an example: https://github.com/openjdk/jdk/blob/b78896b9aafcb15f453eaed6e154a5461581407b/src/java.base/share/classes/java/lang/invoke/LambdaFormEditor.java#L910. LHS `1` and RHS `pos` are both loop invariant > > Passes tier1 locally on Linux machine. Passes GHA on my fork. Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: @run driver -> @run main ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17375/files - new: https://git.openjdk.org/jdk/pull/17375/files/b151293d..33e34b03 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17375&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17375&range=11-12 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17375.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17375/head:pull/17375 PR: https://git.openjdk.org/jdk/pull/17375 From duke at openjdk.org Thu Mar 21 18:38:36 2024 From: duke at openjdk.org (Joshua Cao) Date: Thu, 21 Mar 2024 18:38:36 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v5] In-Reply-To: References: <0mSC33e8Dm1pwOo_xlx48AwfkB1C9ZNIVqD8UdSW07U=.866a7c2a-59cf-4bab-8bda-dcd8a3f337de@github.com> Message-ID: On Fri, 1 Mar 2024 05:41:14 GMT, Emanuel Peter wrote: >>> One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? >> >> Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. >> >> I ran `make CONF=linux-x86_64-server-fastdebug test TEST=all TEST_VM_OPTS=-XX:-TieredCompilation` on my Linux machine. I have 4 failures in `SctpChannel` and 3 failures in `CAInterop.java`, but they also fail on master branch so they should not be caused by this patch. Hopefully this adds a little more confidence. > > @caojoshua > > I also ran our internal testing and it looks ok (only unrelated failures). But of course that is only on tests that we have, and if the other reassociations are not tested, then that helps little ;) > >> Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. > > Could you please add a result verification test per case of pre-existing reassociation? Otherwise I'm afraid it is hard to be sure you did not break those cases. @eme64 thanks for reviewing. changes are in. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17375#issuecomment-2013257650 From bkilambi at openjdk.org Thu Mar 21 20:41:23 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 21 Mar 2024 20:41:23 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 12:54:29 GMT, Emanuel Peter wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Naming changes: replace strict/non-strict with more technical terms > > src/hotspot/share/opto/vectornode.hpp line 270: > >> 268: // when it is auto-vectorized as auto-vectorization mandates the operation to be >> 269: // non-associative (strictly ordered). >> 270: bool _is_associative; > > Could this be a `const`? Hi, what is the reason to declare it as a `const`? It is declared as a `private` member variable with no "setter" function either. It is not easy to modify this value from any other part of the code anyway. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1534645192 From kvn at openjdk.org Thu Mar 21 22:50:31 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 21 Mar 2024 22:50:31 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 14:31:17 GMT, Roland Westrelin wrote: > > ``` > > * No RangeCheck smearing, or other CFG between the stores: `RC[0], store[0], RC[1], store[1], RC[2], store[2], RC[3], store[3]`. Not so simple. We can merge the 4 stores on the normal path, where all RC's pass. But we have to remove all old stores from that path. But the `RC[1], RC[2], RC[3]` false paths need some of those stores. So the only way I see is to duplicate all stores for the branches, so that we are sure that they sink out into the trap-paths. > > ``` > > I also think you need to duplicate stores. My opinion is that we want to stick with the simpler cases (your first and second bullets) unless it's obvious it doesn't cover all use cases. It's always possible to revisit the optimization down the road if it's observed that there are cases that are not covered. I completely agree with Roland. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-2013981048 From ksakata at openjdk.org Fri Mar 22 00:59:24 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Fri, 22 Mar 2024 00:59:24 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:41:29 GMT, Koichi Sakata wrote: > This is a trivial change to remove an extra whitespace. > > A double whitespace is printed because method->print_short_name already adds a whitespace before the name. > > ### Test > > For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. > > > @Test > @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) > @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) > public static void test15(Object o) { > > > This change was only for testing, so I reverted back to the original code after the test. > > #### Execution Result > > Before the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" > - Failed comparison: [found] 1 = 11 [given] > - Matched node: > * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) > > > After the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Cons... Thanks to both of you for taking the time to review this pull request! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18181#issuecomment-2014125560 From ksakata at openjdk.org Fri Mar 22 00:59:24 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Fri, 22 Mar 2024 00:59:24 GMT Subject: Integrated: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output In-Reply-To: References: Message-ID: <5w7DsB41XtIcbT3YrN8cf_C802RbAjUrwxAU2dADUxA=.e2293e13-5ab9-48c9-94c7-f528ee7233cb@github.com> On Mon, 11 Mar 2024 07:41:29 GMT, Koichi Sakata wrote: > This is a trivial change to remove an extra whitespace. > > A double whitespace is printed because method->print_short_name already adds a whitespace before the name. > > ### Test > > For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. > > > @Test > @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) > @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) > public static void test15(Object o) { > > > This change was only for testing, so I reverted back to the original code after the test. > > #### Execution Result > > Before the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" > - Failed comparison: [found] 1 = 11 [given] > - Matched node: > * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) > > > After the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Cons... This pull request has now been integrated. Changeset: da009214 Author: Koichi Sakata URL: https://git.openjdk.org/jdk/commit/da009214f19f73965495b8462c9dcff5db8ae7ae Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8320404: Double whitespace in SubTypeCheckNode::dump_spec output Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/18181 From epeter at openjdk.org Fri Mar 22 05:43:25 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Mar 2024 05:43:25 GMT Subject: RFR: 8320725: C2: Add "is_associative" flag for floating-point add-reduction [v3] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 20:38:17 GMT, Bhavana Kilambi wrote: >> src/hotspot/share/opto/vectornode.hpp line 270: >> >>> 268: // when it is auto-vectorized as auto-vectorization mandates the operation to be >>> 269: // non-associative (strictly ordered). >>> 270: bool _is_associative; >> >> Could this be a `const`? > > Hi, what is the reason to declare it as a `const`? It is declared as a `private` member variable with no "setter" function either. It is not easy to modify this value from any other part of the code anyway. Generally, it is better to declare things `const` if they can be. It tells the reader of the code that the field will never be changed. Even if things are fine now, a future contributor might misunderstand how the field is to be used, and start modifying it in a `Ideal` method for example. But if it is not simple to make it `const` for some reason, then don't do it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1535079275 From rcastanedalo at openjdk.org Fri Mar 22 07:48:22 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 22 Mar 2024 07:48:22 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN In-Reply-To: References: Message-ID: <9jm3my5poRfSNrLfDONCL6FzCK4M6qJmzLKKj4fIVlE=.1878c422-f9f6-4498-81b3-0acb2e176c75@github.com> On Thu, 21 Mar 2024 15:31:58 GMT, Daniel Lund?n wrote: > The [`assert`](https://github.com/openjdk/jdk/blob/6f2676dc5f09d350c359f906b07f6f6d0d17f030/src/hotspot/share/opto/graphKit.cpp#L1567) added in [JDK-8310524](https://bugs.openjdk.org/browse/JDK-8310524) is too strong and may sometimes not hold due to the [GVN transformation in `LoadNode::make`](https://github.com/openjdk/jdk/blob/8cb9b479c529c058aee50f83920db650b0c18045/src/hotspot/share/opto/memnode.cpp#L973). > > ### Changeset > Remove the `assert`. > > ### Testing > N/A Changes requested by rcastanedalo (Reviewer). src/hotspot/share/opto/graphKit.cpp line 1566: > 1564: record_for_igvn(ld); > 1565: if (ld->is_DecodeN()) { > 1566: // Also record the actual load (LoadN) in case ld is DecodeN This comment still reads as if we expect that `ld->in(1)` is necessarily a `LoadN`, which might mislead the reader. Could you extend it with a note that clarifies that in some corner cases `ld->in(1)` might be something else, e.g. a `Phi` like in this issue, but that is OK because it only means we might do unnecessary work during IGVN? ------------- PR Review: https://git.openjdk.org/jdk/pull/18434#pullrequestreview-1954126431 PR Review Comment: https://git.openjdk.org/jdk/pull/18434#discussion_r1535166821 From dlunden at openjdk.org Fri Mar 22 08:19:28 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 22 Mar 2024 08:19:28 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN [v2] In-Reply-To: References: Message-ID: > The [`assert`](https://github.com/openjdk/jdk/blob/6f2676dc5f09d350c359f906b07f6f6d0d17f030/src/hotspot/share/opto/graphKit.cpp#L1567) added in [JDK-8310524](https://bugs.openjdk.org/browse/JDK-8310524) is too strong and may sometimes not hold due to the [GVN transformation in `LoadNode::make`](https://github.com/openjdk/jdk/blob/8cb9b479c529c058aee50f83920db650b0c18045/src/hotspot/share/opto/memnode.cpp#L973). > > ### Changeset > Remove the `assert`. > > ### Testing > N/A Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Elaborate in LoadN comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18434/files - new: https://git.openjdk.org/jdk/pull/18434/files/c74178e7..fee1fff0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18434&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18434&range=00-01 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18434.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18434/head:pull/18434 PR: https://git.openjdk.org/jdk/pull/18434 From dlunden at openjdk.org Fri Mar 22 08:19:29 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 22 Mar 2024 08:19:29 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN [v2] In-Reply-To: <9jm3my5poRfSNrLfDONCL6FzCK4M6qJmzLKKj4fIVlE=.1878c422-f9f6-4498-81b3-0acb2e176c75@github.com> References: <9jm3my5poRfSNrLfDONCL6FzCK4M6qJmzLKKj4fIVlE=.1878c422-f9f6-4498-81b3-0acb2e176c75@github.com> Message-ID: <1Cnrd0WihYRapLjvkJEsE5kVPwCicNuDJgJTp1XPGU0=.9659c2d0-a70f-42ed-95b7-e5208bcc6419@github.com> On Fri, 22 Mar 2024 07:45:31 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Elaborate in LoadN comment > > src/hotspot/share/opto/graphKit.cpp line 1566: > >> 1564: record_for_igvn(ld); >> 1565: if (ld->is_DecodeN()) { >> 1566: // Also record the actual load (LoadN) in case ld is DecodeN > > This comment still reads as if we expect that `ld->in(1)` is necessarily a `LoadN`, which might mislead the reader. Could you extend it with a note that clarifies that in some corner cases `ld->in(1)` might be something else, e.g. a `Phi` like in this issue, but that is OK because it only means we might do unnecessary work during IGVN? Good suggestion, updated now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18434#discussion_r1535198041 From dlunden at openjdk.org Fri Mar 22 08:26:23 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 22 Mar 2024 08:26:23 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 17:16:19 GMT, Vladimir Kozlov wrote: > Good. > > > Add label to JBS something `test*hard`. I don't remember its exact name. > > Never mind I see you did it already. Thanks @vnkozlov. I've also documented my findings when trying to reproduce the issue in the issue description. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18434#issuecomment-2014595254 From rcastanedalo at openjdk.org Fri Mar 22 08:33:21 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 22 Mar 2024 08:33:21 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 08:19:28 GMT, Daniel Lund?n wrote: >> The [`assert`](https://github.com/openjdk/jdk/blob/6f2676dc5f09d350c359f906b07f6f6d0d17f030/src/hotspot/share/opto/graphKit.cpp#L1567) added in [JDK-8310524](https://bugs.openjdk.org/browse/JDK-8310524) is too strong and may sometimes not hold due to the [GVN transformation in `LoadNode::make`](https://github.com/openjdk/jdk/blob/8cb9b479c529c058aee50f83920db650b0c18045/src/hotspot/share/opto/memnode.cpp#L973). >> >> ### Changeset >> Remove the `assert`. >> >> ### Testing >> N/A > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Elaborate in LoadN comment Looks good, thanks! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18434#pullrequestreview-1954216138 From epeter at openjdk.org Fri Mar 22 09:18:26 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Mar 2024 09:18:26 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v8] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 02:05:56 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review and add IR test for vectorization and reduction Tests pass -> Approved! Thanks for doing all the great work @jaskarth ! Looking forward to what you are doing next ;) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17574#pullrequestreview-1954306867 From rehn at openjdk.org Fri Mar 22 10:28:24 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 22 Mar 2024 10:28:24 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 06:58:43 GMT, Robbin Ehn wrote: >> Hi, please consider. >> >> [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. >> Tested with gcc and clang, and llvm and binutils backend. >> >> I didn't find any use of the "DLL_ENTRY", so I removed it. >> >> Thanks, Robbin > > Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: > > remove swap file @magicus is this how you want us to export these symbols? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2014786824 From ihse at openjdk.org Fri Mar 22 11:48:23 2024 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Fri, 22 Mar 2024 11:48:23 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 10:25:33 GMT, Robbin Ehn wrote: > is this how you want us to export these symbols? Close but no cigar. :-) Use `JNIEXPORT` instead, that is properly defined for this purpose and works on all compilers. You will need to also add: #include "jni.h" If this is not picked up correctly, let me know and I'll help you get the include paths correctly in the build. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2014907958 From ihse at openjdk.org Fri Mar 22 11:48:24 2024 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Fri, 22 Mar 2024 11:48:24 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 06:58:43 GMT, Robbin Ehn wrote: >> Hi, please consider. >> >> [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. >> Tested with gcc and clang, and llvm and binutils backend. >> >> I didn't find any use of the "DLL_ENTRY", so I removed it. >> >> Thanks, Robbin > > Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: > > remove swap file Also, apologies for forgetting to check hsdis when I changed the visibility. :-( ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2014910765 From rehn at openjdk.org Fri Mar 22 13:41:22 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 22 Mar 2024 13:41:22 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 11:43:34 GMT, Magnus Ihse Bursie wrote: > > is this how you want us to export these symbols? > > Close but no cigar. :-) > > Use `JNIEXPORT` instead, that is properly defined for this purpose and works on all compilers. You will need to also add: > > ``` > #include "jni.h" > ``` > > If this is not picked up correctly, let me know and I'll help you get the include paths correctly in the build. It's stand alone library, should we really make it dependent on the JDK? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2015129906 From ihse at openjdk.org Fri Mar 22 14:13:22 2024 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Fri, 22 Mar 2024 14:13:22 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 06:58:43 GMT, Robbin Ehn wrote: >> Hi, please consider. >> >> [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. >> Tested with gcc and clang, and llvm and binutils backend. >> >> I didn't find any use of the "DLL_ENTRY", so I removed it. >> >> Thanks, Robbin > > Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: > > remove swap file Unless you start calling JDK functions, you will not make a program less stand-alone by including jni.h. In this case, you will only use a compile-time definition. Ideally, we should have had more general EXPORT definitions separate from the rest of the JNI code, but someone started doing things that way, and well, here we are, 25 years later and now JNIEXPORT is everywhere in the JDK source base. :( I'd say that access to JNIEXPORT is about 50% of the reason jni.h is included... ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2015191729 From chagedorn at openjdk.org Fri Mar 22 16:09:23 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 22 Mar 2024 16:09:23 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 17:15:38 GMT, Roland Westrelin wrote: > The assert fails because peeling happens at a single entry > `Region`. That `Region` only has a single input because other inputs > were found unreachable and removed by > `PhaseIdealLoop::Dominators()`. The fix I propose is to have > `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s > entirely in this case. src/hotspot/share/opto/domgraph.cpp line 512: > 510: remove_single_entry_region(t, tdom, dom, _igvn); > 511: } > 512: _idom[t->_control->_idx] = dom; // Set immediate dominator Removing the regions during `Dominators()` seems reasonable. I guess doing a pass of IGVN after `Dominators()` is probably too much to get these regions removed? Could you also remove the region and the phi where the unreachable loops are cleaned up and the region and phis become single-entry nodes? I.e. here: https://github.com/openjdk/jdk/blob/ce7ebaa606f96fdfee66d300b56022d9903b5ae3/src/hotspot/share/opto/domgraph.cpp#L453-L463 test/hotspot/jtreg/compiler/loopopts/TestPartialPeelingAtSingleInputRegion.java line 2: > 1: /* > 2: * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved. Suggestion: * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved. test/hotspot/jtreg/compiler/loopopts/TestPartialPeelingAtSingleInputRegion.java line 51: > 49: > 50: public static void main(String[] args) { > 51: for (int i = 0; i < 50_000; ++i) { Just a minor thing: Do you really need 50000 iterations or would fewer be sufficient to trigger the bug? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18353#discussion_r1535828422 PR Review Comment: https://git.openjdk.org/jdk/pull/18353#discussion_r1535822221 PR Review Comment: https://git.openjdk.org/jdk/pull/18353#discussion_r1535823273 From chagedorn at openjdk.org Fri Mar 22 16:28:25 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 22 Mar 2024 16:28:25 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 13:41:49 GMT, Emanuel Peter wrote: >> This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. >> >> #### How `create_bool_from_template_assertion_predicate()` Works >> Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: >> 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): >> https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 >> >> 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. >> >> #### Missing Visited Set >> The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: >> >> >> ... >> | >> E >> | >> D >> / \ >> B C >> \ / >> A >> >> DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... >> >> With each diamond, the number of revisits of each node above doubles. >> >> #### Endless DFS in Edge-Cases >> In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because... > > This looks fantastic, great work :) Thanks a lot for your review and comments @eme64! I will get back to them next week :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18293#issuecomment-2015454964 From roland at openjdk.org Fri Mar 22 16:33:30 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 22 Mar 2024 16:33:30 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: > Both failures occur because `ABS(scale * stride_con)` overflows (scale > a really large long number). I reworked the test so overflow is no > longer an issue. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18397/files - new: https://git.openjdk.org/jdk/pull/18397/files/6890e385..d4cddf82 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18397&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18397&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18397.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18397/head:pull/18397 PR: https://git.openjdk.org/jdk/pull/18397 From roland at openjdk.org Fri Mar 22 16:37:22 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 22 Mar 2024 16:37:22 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 18:49:33 GMT, Dean Long wrote: >> src/hotspot/share/opto/loopnode.cpp line 1110: >> >>> 1108: if (loop->is_range_check_if(if_proj, this, T_LONG, phi, range, offset, scale) && >>> 1109: loop->is_invariant(range) && loop->is_invariant(offset) && >>> 1110: original_iters_limit / ABS(scale) >= min_iters * ABS(stride_con)) { >> >> I assume there is check somewhere that `stride_con` is not `MIN_INT`. > > In my opinion ABS() should assert that it has legal input (not MIN_INT) and output (non-negative value) in debug builds. Thanks for reviewing this. We can't get to this code if `stride_con` is `MIN_INT` because some other condition (that doesn't explicitly check that `stride_con` is not `MIN_INT`) causes a bail out from the transformation. I added an explicit bail out in that case in a new commit anyway to make the code more robust. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1535866603 From roland at openjdk.org Fri Mar 22 16:37:49 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 22 Mar 2024 16:37:49 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v2] In-Reply-To: References: Message-ID: > The assert fails because peeling happens at a single entry > `Region`. That `Region` only has a single input because other inputs > were found unreachable and removed by > `PhaseIdealLoop::Dominators()`. The fix I propose is to have > `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s > entirely in this case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopopts/TestPartialPeelingAtSingleInputRegion.java Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18353/files - new: https://git.openjdk.org/jdk/pull/18353/files/5cbab303..bb2cb9ea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18353&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18353&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18353.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18353/head:pull/18353 PR: https://git.openjdk.org/jdk/pull/18353 From roland at openjdk.org Fri Mar 22 16:43:22 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 22 Mar 2024 16:43:22 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 16:06:00 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/loopopts/TestPartialPeelingAtSingleInputRegion.java >> >> Co-authored-by: Christian Hagedorn > > src/hotspot/share/opto/domgraph.cpp line 512: > >> 510: remove_single_entry_region(t, tdom, dom, _igvn); >> 511: } >> 512: _idom[t->_control->_idx] = dom; // Set immediate dominator > > Removing the regions during `Dominators()` seems reasonable. I guess doing a pass of IGVN after `Dominators()` is probably too much to get these regions removed? > > Could you also remove the region and the phi where the unreachable loops are cleaned up and the region and phis become single-entry nodes? I.e. here: > https://github.com/openjdk/jdk/blob/ce7ebaa606f96fdfee66d300b56022d9903b5ae3/src/hotspot/share/opto/domgraph.cpp#L453-L463 Thanks for reviewing this. > Removing the regions during `Dominators()` seems reasonable. I guess doing a pass of IGVN after `Dominators()` is probably too much to get these regions removed? It does feel like a lot of overhead for such a simple corner case. > Could you also remove the region and the phi where the unreachable loops are cleaned up and the region and phis become single-entry nodes? I.e. here: I considered it but it felt harder. The algorithm collects cfg nodes in dfs and then iterate over them several times. If we remove a region at the point you mention, it would need to be removed from the dfs node list. Or we would need to make later iterations over the dfs node list handle dead region nodes. It's not a problem when it's done later as I propose here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18353#discussion_r1535873143 From roland at openjdk.org Fri Mar 22 16:48:35 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 22 Mar 2024 16:48:35 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v3] In-Reply-To: References: Message-ID: > The assert fails because peeling happens at a single entry > `Region`. That `Region` only has a single input because other inputs > were found unreachable and removed by > `PhaseIdealLoop::Dominators()`. The fix I propose is to have > `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s > entirely in this case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18353/files - new: https://git.openjdk.org/jdk/pull/18353/files/bb2cb9ea..d201aee2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18353&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18353&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18353.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18353/head:pull/18353 PR: https://git.openjdk.org/jdk/pull/18353 From roland at openjdk.org Fri Mar 22 16:48:35 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 22 Mar 2024 16:48:35 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v3] In-Reply-To: References: Message-ID: <1rP6BIno3l-v6feJH0bbvwUGkizZ91CiNU0uyGngeME=.df2654b9-1ba2-421a-9712-c02425c642af@github.com> On Fri, 22 Mar 2024 16:01:46 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > test/hotspot/jtreg/compiler/loopopts/TestPartialPeelingAtSingleInputRegion.java line 51: > >> 49: >> 50: public static void main(String[] args) { >> 51: for (int i = 0; i < 50_000; ++i) { > > Just a minor thing: Do you really need 50000 iterations or would fewer be sufficient to trigger the bug? Good catch. It reproduces with 10000. I pushed a comment that makes that change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18353#discussion_r1535880857 From kvn at openjdk.org Fri Mar 22 16:56:27 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 22 Mar 2024 16:56:27 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 16:33:30 GMT, Roland Westrelin wrote: >> Both failures occur because `ABS(scale * stride_con)` overflows (scale >> a really large long number). I reworked the test so overflow is no >> longer an issue. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18397#pullrequestreview-1955306690 From jkarthikeyan at openjdk.org Fri Mar 22 18:46:31 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 22 Mar 2024 18:46:31 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v8] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 02:05:56 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review and add IR test for vectorization and reduction Great, thanks again for the thorough review and the very interesting discussion! It's motivated me to understand the CMove heuristic better, so I may look at that next :) I'll integrate this patch after a second review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-2015702469 From duke at openjdk.org Fri Mar 22 18:48:56 2024 From: duke at openjdk.org (Joshua Cao) Date: Fri, 22 Mar 2024 18:48:56 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v14] In-Reply-To: References: Message-ID: > // inv1 == (x + inv2) => ( inv1 - inv2 ) == x > // inv1 == (x - inv2) => ( inv1 + inv2 ) == x > // inv1 == (inv2 - x) => (-inv1 + inv2 ) == x > > > For example, > > > fn(inv1, inv2) > while(...) > x = foobar() > if inv1 == x + inv2 > blackhole() > > > We can transform this into > > > fn(inv1, inv2) > t = inv1 - inv2 > while(...) > x = foobar() > if t == x > blackhole() > > > Here is an example: https://github.com/openjdk/jdk/blob/b78896b9aafcb15f453eaed6e154a5461581407b/src/java.base/share/classes/java/lang/invoke/LambdaFormEditor.java#L910. LHS `1` and RHS `pos` are both loop invariant > > Passes tier1 locally on Linux machine. Passes GHA on my fork. Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - Merge branch 'master' into licm - @run driver -> @run main - Add tests for add/sub reassociation - Merge branch 'master' into licm - Make inputs deterministic. Make size an arg. Fix comments. Formatting. - Update test to utilize @setup method for arguments - Merge branch 'master' into licm - Add correctness test for some random tests with random inputs - Add some correctness tests where we do reassociate - Remove unused TestInfo parameter. Have some tests exit mid-loop. - ... and 7 more: https://git.openjdk.org/jdk/compare/aac43a2d...32cb9c0d ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17375/files - new: https://git.openjdk.org/jdk/pull/17375/files/33e34b03..32cb9c0d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17375&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17375&range=12-13 Stats: 55614 lines in 2299 files changed: 8712 ins; 5522 del; 41380 mod Patch: https://git.openjdk.org/jdk/pull/17375.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17375/head:pull/17375 PR: https://git.openjdk.org/jdk/pull/17375 From qamai at openjdk.org Fri Mar 22 20:17:26 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 22 Mar 2024 20:17:26 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v8] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 02:05:56 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Apply changes from code review and add IR test for vectorization and reduction LGTM, thanks a lot! ------------- Marked as reviewed by qamai (Committer). PR Review: https://git.openjdk.org/jdk/pull/17574#pullrequestreview-1955715130 From jkarthikeyan at openjdk.org Fri Mar 22 21:05:27 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 22 Mar 2024 21:05:27 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: <_hXaYI6ApXK95GMVGEMub-NFOQ7ifLh702Je3R5-FT8=.ae3a94b9-e58c-4f33-9651-8ba4a45c781e@github.com> References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_hXaYI6ApXK95GMVGEMub-NFOQ7ifLh702Je3R5-FT8=.ae3a94b9-e58c-4f33-9651-8ba4a45c781e@github.com> Message-ID: <45wVCE4Y2XOB3iyuyDK8Vv0v1Ll1sZMxdUYTHOPEzCk=.6315deeb-f826-4f50-b595-0f4f65b347ef@github.com> On Tue, 27 Feb 2024 17:31:44 GMT, Quan Anh Mai wrote: >> @merykitty I am thinking of a case like this: >> >> a = >> b = >> x = (a < b) ? a : b; >> >> If in most cases we take `b` (because it is larger), then we might speculatively assign `x = b`, before we have finished computing `a`. That way we can already continue (speculatively) with `x = b`, while `a` is still computing. If the speculation is wrong, then the CPU flushes the pipeline. >> >> If this is converted to `max/min`, then we need to wait for `a` to be computed. >> >> @merykitty @jaskarth does this make sense as an example for a potential performance regression? > > @eme64 Thanks for making that clear. I'm asking @jaskarth if it is easier to transform a `CMove` into a `Min`/`Max` instead of trying to look for matching `Phi`s. Thanks for the review, @merykitty! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-2015910313 From sgibbons at openjdk.org Sat Mar 23 02:16:57 2024 From: sgibbons at openjdk.org (Scott Gibbons) Date: Sat, 23 Mar 2024 02:16:57 GMT Subject: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14] In-Reply-To: References: Message-ID: > Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers: > > > Benchmark Score Latest > StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x > StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x > StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x > StringIndexOf.constantPattern 9.361 11.906 1.271872663x > StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x > StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x > StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x > StringIndexOf.success 9.186 9.713 1.057369911x > StringIndexOf.successBig 14.341 46.343 3.231504079x > StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x > StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x > StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x > StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x > StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x > StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x > StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x > StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803 Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: - Merge branch 'openjdk:master' into indexof - Cleaned up, ready for review - Pre-cleanup code - Add JMH. Add 16-byte compares to arrays_equals - Better method for mask creation - Merge branch 'openjdk:master' into indexof - Most cleanup done. - Remove header dependency - Works - needs cleanup - Passes tests. - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12 ------------- Changes: https://git.openjdk.org/jdk/pull/16753/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16753&range=13 Stats: 4905 lines in 19 files changed: 4551 ins; 241 del; 113 mod Patch: https://git.openjdk.org/jdk/pull/16753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16753/head:pull/16753 PR: https://git.openjdk.org/jdk/pull/16753 From sgibbons at openjdk.org Sat Mar 23 02:16:58 2024 From: sgibbons at openjdk.org (Scott Gibbons) Date: Sat, 23 Mar 2024 02:16:58 GMT Subject: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v13] In-Reply-To: References: Message-ID: On Thu, 22 Feb 2024 03:15:10 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x >> StringIndexOf.success 9.186 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Addressed some review coments; replaced hard-coded registers with descriptive names. Re-opening this PR after some insightful comments, which resulted in a re-design. Now ready for review. Now showing ~1.6x performance gain on original StringIndexOf benchmark. I added a benchmark (StringIndexOfHuge) that measures performance on large-ish strings (e.g., 34-byte substring within a 2052-byte string). This benchmark performed on average ~14x original. Sorry for the large change, but it couldn't be done piecemeal. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016308399 From gcao at openjdk.org Sat Mar 23 08:44:53 2024 From: gcao at openjdk.org (Gui Cao) Date: Sat, 23 Mar 2024 08:44:53 GMT Subject: RFR: 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals [v2] In-Reply-To: References: Message-ID: > Hi, The current behavior of C2_MacroAssembler::arrays_equals always load longword before comparison. > When array[0] is aligned to 32-bit (especially after JDK-8139457 which tries to relax alignment > of array elements), the last longword load will exceed the array limit and may touch the next > word beyond object layout in heap memory. So this should bear a similar problem as JDK-8328138. > > Proposed fix changes this behavior and aligns with handling in C2_MacroAssembler::string_equals, > which will check the number of remaining array elements before loading the next longword. > No obvious changes witnessed from the JMH numbers or benchmarks like SPECjbb2015. > > Patch also removed the AvoidUnalignedAccesses check in C2_MacroAssembler::string_equals as we > don't see extra performance gain when setting AvoidUnalignedAccesses to false when testing the > JMH tests or benchmarks like SPECjbb2015 on three popular RISC-V hardware platforms. We can > consider adding it back if it turns out to be usefull on future new hardwares. > > > ### Correctness test: > - [x] Run tier1-3, hotspot:tier4 tests on LicheePi 4A (release) > - [x] Run tier1-3, hotspot:tier4 tests on SOPHON SG2042 (release) > > > ### JMH test: > > #### 1. test/micro/org/openjdk/bench/java/util/ArraysEquals.java > 1. SiFive unmatched > > Before: > Benchmark Mode Cnt Score Error Units > ArraysEquals.testByteFalseBeginning avgt 12 37.804 ? 7.292 ns/op > ArraysEquals.testByteFalseEnd avgt 12 77.972 ? 3.208 ns/op > ArraysEquals.testByteFalseMid avgt 12 54.427 ? 6.436 ns/op > ArraysEquals.testByteTrue avgt 12 75.121 ? 5.172 ns/op > ArraysEquals.testCharFalseBeginning avgt 12 42.486 ? 6.526 ns/op > ArraysEquals.testCharFalseEnd avgt 12 122.208 ? 2.533 ns/op > ArraysEquals.testCharFalseMid avgt 12 83.891 ? 3.680 ns/op > ArraysEquals.testCharTrue avgt 12 122.096 ? 5.519 ns/op > > After: > Benchmark Mode Cnt Score Error Units > ArraysEquals.testByteFalseBeginning avgt 12 32.638 ? 7.279 ns/op > ArraysEquals.testByteFalseEnd avgt 12 73.013 ? 8.081 ns/op > ArraysEquals.testByteFalseMid avgt 12 43.619 ? 6.104 ns/op > ArraysEquals.testByteTrue avgt 12 83.044 ? 8.207 ns/op > ArraysEquals.testCharFalseBeginning avgt 12 39.154 ? 5.233 ns/op > ArraysEquals.testCharFalseEnd avgt 12 122.072 ? 7.784 ns/op > ArraysEquals.testCharFalseMid avgt 12 67.831 ? 9.218 ns/op > Ar... Gui Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8328404 - 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18370/files - new: https://git.openjdk.org/jdk/pull/18370/files/8824e1c6..baaae2cc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18370&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18370&range=00-01 Stats: 60167 lines in 2421 files changed: 10903 ins; 7502 del; 41762 mod Patch: https://git.openjdk.org/jdk/pull/18370.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18370/head:pull/18370 PR: https://git.openjdk.org/jdk/pull/18370 From aph at openjdk.org Sat Mar 23 09:45:26 2024 From: aph at openjdk.org (Andrew Haley) Date: Sat, 23 Mar 2024 09:45:26 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 13:38:54 GMT, Robbin Ehn wrote: > > > is this how you want us to export these symbols? > > > > > > Close but no cigar. :-) > > Use `JNIEXPORT` instead, that is properly defined for this purpose and works on all compilers. You will need to also add: > > ``` > > #include "jni.h" > > ``` > > > > If this is not picked up correctly, let me know and I'll help you get the include paths correctly in the build. > > It's stand alone library, should we really make it dependent on the JDK? No. And neither should we compile or link it with "-fvisibility=hidden". That is the root of this problem. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2016427942 From duke at openjdk.org Sat Mar 23 10:01:32 2024 From: duke at openjdk.org (Francesco Nigro) Date: Sat, 23 Mar 2024 10:01:32 GMT Subject: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14] In-Reply-To: References: Message-ID: On Sat, 23 Mar 2024 02:16:57 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x >> StringIndexOf.success 9.186 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803 > > Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: > > - Merge branch 'openjdk:master' into indexof > - Cleaned up, ready for review > - Pre-cleanup code > - Add JMH. Add 16-byte compares to arrays_equals > - Better method for mask creation > - Merge branch 'openjdk:master' into indexof > - Most cleanup done. > - Remove header dependency > - Works - needs cleanup > - Passes tests. > - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12 Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, which is manually loop unrolling the head processing (first < 8 bytes) to artificially make sure the branch predictor got different branches to care AND JIT won't make it wrong. We have measured (I can provide a link of the benchmark and results, If you are interested) that it delivers a much better performance on tiny strings and makes a smoother degradation vs perfectly aligned string length as well. Clearly this tends to be much visible if the input strings have shuffled delimiter positions, to make the branch prediction misses more relevant. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016432345 From sgibbons at openjdk.org Sat Mar 23 16:50:28 2024 From: sgibbons at openjdk.org (Scott Gibbons) Date: Sat, 23 Mar 2024 16:50:28 GMT Subject: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14] In-Reply-To: References: Message-ID: On Sat, 23 Mar 2024 09:57:49 GMT, Francesco Nigro wrote: >> Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: >> >> - Merge branch 'openjdk:master' into indexof >> - Cleaned up, ready for review >> - Pre-cleanup code >> - Add JMH. Add 16-byte compares to arrays_equals >> - Better method for mask creation >> - Merge branch 'openjdk:master' into indexof >> - Most cleanup done. >> - Remove header dependency >> - Works - needs cleanup >> - Passes tests. >> - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12 > > Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, which is manually loop unrolling the head processing (first < 8 bytes) to artificially make sure the branch predictor got different branches to care AND JIT won't make it wrong. We have measured (I can provide a link of the benchmark and results, If you are interested) that it delivers a much better performance on tiny strings and makes a smoother degradation vs perfectly aligned string length as well. Clearly this tends to be much visible if the input strings have shuffled delimiter positions, to make the branch prediction misses more relevant. Hi @franz1981. I'd be interested in seeing the code. I looked [here](https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/common/src/main/java/io/netty/util/AsciiString.java#L696), but that just looks like a na?ve implementation, so I must be missing something. This code uses vector compares to search for the first byte of the substring (needle) in 32-byte chunks of the input string (haystack). However, once a character is found, it also checks for a match corresponding to the last byte of the needle within the haystack before doing the full needle comparison. This is also in 32-byte chunks. That is, we load a vector register with 32 (or 16 for wide chars) copies of the first byte of the needle, and another with copies of the last byte of the needle. The first comparison is done at the start of the haystack, giving us indication of the presence and index of the first byte. We then compare the last byte of the needle at the haystack indexed at needle length - 1 (i.e., the last byte). This tells us if the last byte of the needle appears in the correct relative position within the haystack. ANDing these results tells us whether or not we have a candidate needle within the haystack, as well as the position of the needle. Only then do we do a full char-by-char comparison of the needle to the haystack (this is also done with vector instructions when possible). A lot of this code is there to handle the cases of small-ish needles within small-ish haystacks, where 32-byte reads are not possible (due to reading past the end of the strings, possibly generating a page fault). I handle less than 32-byte haystacks by copying the haystack to the stack, where I can be assured that 32-byte reads will be possible. So there are special cases for haystack < 32 bytes with needle sizes <= 10 (arbitrary) and one for haystacks > 32 bytes and needle size <= 10. I also added a section for haystacks <= 32 bytes and needle sizes < 5, which seem to be the most common cases. This path copies the haystack to the stack (a single vector read & write) and up to 5 vector comparisons, one for each byte of the needle, with no branching or looping. I'd be very interested in seeing the Netty SWAR implementation. Thanks for the comment. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016544706 From duke at openjdk.org Sat Mar 23 19:02:28 2024 From: duke at openjdk.org (Francesco Nigro) Date: Sat, 23 Mar 2024 19:02:28 GMT Subject: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14] In-Reply-To: References: Message-ID: On Sat, 23 Mar 2024 02:16:57 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x >> StringIndexOf.success 9.186 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803 > > Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: > > - Merge branch 'openjdk:master' into indexof > - Cleaned up, ready for review > - Pre-cleanup code > - Add JMH. Add 16-byte compares to arrays_equals > - Better method for mask creation > - Merge branch 'openjdk:master' into indexof > - Most cleanup done. > - Remove header dependency > - Works - needs cleanup > - Passes tests. > - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12 Sure thing: https://github.com/netty/netty/pull/13534#issuecomment-1685247165 It's the comparison with String::indexOf while the impl is: https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L590 ------------- PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016576139 From jbhateja at openjdk.org Sun Mar 24 10:04:48 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sun, 24 Mar 2024 10:04:48 GMT Subject: RFR: 8328181: C2: assert(MaxVectorSize >= 32) failed: vector length should be >= 32 Message-ID: This bug fix patch tightens the predication check for small constant length clear array pattern and relaxes associated feature checks. Modified few comments for clarity. Kindly review and approve. Best Regards, Jatin ------------- Commit messages: - Adding Testpoint - Some comments modifications - 8328181: C2: assert(MaxVectorSize >= 32) failed: vector length should be >= 32 Changes: https://git.openjdk.org/jdk/pull/18464/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18464&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328181 Stats: 21 lines in 5 files changed: 6 ins; 0 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/18464.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18464/head:pull/18464 PR: https://git.openjdk.org/jdk/pull/18464 From gcao at openjdk.org Mon Mar 25 01:17:27 2024 From: gcao at openjdk.org (Gui Cao) Date: Mon, 25 Mar 2024 01:17:27 GMT Subject: RFR: 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals [v2] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 02:47:07 GMT, Fei Yang wrote: >> Gui Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge remote-tracking branch 'upstream/master' into JDK-8328404 >> - 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals > > Looks fine. In fact, array[0] could have an alignement of 32-bit after JDK-8139457 when running with -XX:-UseCompressedClassPointers. In this case, we have base_offset = 20 (bytes). It will also an issue when we add support for lilliput on riscv some day, in which case we will have base_offset = 12 (bytes). @RealFYang : Thanks for your review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18370#issuecomment-2017041489 From gcao at openjdk.org Mon Mar 25 01:21:30 2024 From: gcao at openjdk.org (Gui Cao) Date: Mon, 25 Mar 2024 01:21:30 GMT Subject: Integrated: 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 03:44:10 GMT, Gui Cao wrote: > Hi, The current behavior of C2_MacroAssembler::arrays_equals always load longword before comparison. > When array[0] is aligned to 32-bit (especially after JDK-8139457 which tries to relax alignment > of array elements), the last longword load will exceed the array limit and may touch the next > word beyond object layout in heap memory. So this should bear a similar problem as JDK-8328138. > > Proposed fix changes this behavior and aligns with handling in C2_MacroAssembler::string_equals, > which will check the number of remaining array elements before loading the next longword. > No obvious changes witnessed from the JMH numbers or benchmarks like SPECjbb2015. > > Patch also removed the AvoidUnalignedAccesses check in C2_MacroAssembler::string_equals as we > don't see extra performance gain when setting AvoidUnalignedAccesses to false when testing the > JMH tests or benchmarks like SPECjbb2015 on three popular RISC-V hardware platforms. We can > consider adding it back if it turns out to be usefull on future new hardwares. > > > ### Correctness test: > - [x] Run tier1-3, hotspot:tier4 tests on LicheePi 4A (release) > - [x] Run tier1-3, hotspot:tier4 tests on SOPHON SG2042 (release) > > > ### JMH test: > > #### 1. test/micro/org/openjdk/bench/java/util/ArraysEquals.java > 1. SiFive unmatched > > Before: > Benchmark Mode Cnt Score Error Units > ArraysEquals.testByteFalseBeginning avgt 12 37.804 ? 7.292 ns/op > ArraysEquals.testByteFalseEnd avgt 12 77.972 ? 3.208 ns/op > ArraysEquals.testByteFalseMid avgt 12 54.427 ? 6.436 ns/op > ArraysEquals.testByteTrue avgt 12 75.121 ? 5.172 ns/op > ArraysEquals.testCharFalseBeginning avgt 12 42.486 ? 6.526 ns/op > ArraysEquals.testCharFalseEnd avgt 12 122.208 ? 2.533 ns/op > ArraysEquals.testCharFalseMid avgt 12 83.891 ? 3.680 ns/op > ArraysEquals.testCharTrue avgt 12 122.096 ? 5.519 ns/op > > After: > Benchmark Mode Cnt Score Error Units > ArraysEquals.testByteFalseBeginning avgt 12 32.638 ? 7.279 ns/op > ArraysEquals.testByteFalseEnd avgt 12 73.013 ? 8.081 ns/op > ArraysEquals.testByteFalseMid avgt 12 43.619 ? 6.104 ns/op > ArraysEquals.testByteTrue avgt 12 83.044 ? 8.207 ns/op > ArraysEquals.testCharFalseBeginning avgt 12 39.154 ? 5.233 ns/op > ArraysEquals.testCharFalseEnd avgt 12 122.072 ? 7.784 ns/op > ArraysEquals.testCharFalseMid avgt 12 67.831 ? 9.218 ns/op > Ar... This pull request has now been integrated. Changeset: c7b9dc46 Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/c7b9dc463a7e0347fc2e2ce5578e8fb39ea0b733 Stats: 159 lines in 3 files changed: 42 ins; 59 del; 58 mod 8328404: RISC-V: Fix potential crash in C2_MacroAssembler::arrays_equals Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/18370 From epeter at openjdk.org Mon Mar 25 06:22:27 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 06:22:27 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v14] In-Reply-To: References: Message-ID: <0RAuJpGYev-zLd52TE7PCDkxPoXrRT0RNEzzepwVMhc=.46718148-27a3-4741-a09a-021caae72f9f@github.com> On Fri, 22 Mar 2024 18:48:56 GMT, Joshua Cao wrote: >> // inv1 == (x + inv2) => ( inv1 - inv2 ) == x >> // inv1 == (x - inv2) => ( inv1 + inv2 ) == x >> // inv1 == (inv2 - x) => (-inv1 + inv2 ) == x >> >> >> For example, >> >> >> fn(inv1, inv2) >> while(...) >> x = foobar() >> if inv1 == x + inv2 >> blackhole() >> >> >> We can transform this into >> >> >> fn(inv1, inv2) >> t = inv1 - inv2 >> while(...) >> x = foobar() >> if t == x >> blackhole() >> >> >> Here is an example: https://github.com/openjdk/jdk/blob/b78896b9aafcb15f453eaed6e154a5461581407b/src/java.base/share/classes/java/lang/invoke/LambdaFormEditor.java#L910. LHS `1` and RHS `pos` are both loop invariant >> >> Passes tier1 locally on Linux machine. Passes GHA on my fork. > > Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Merge branch 'master' into licm > - @run driver -> @run main > - Add tests for add/sub reassociation > - Merge branch 'master' into licm > - Make inputs deterministic. Make size an arg. Fix comments. Formatting. > - Update test to utilize @setup method for arguments > - Merge branch 'master' into licm > - Add correctness test for some random tests with random inputs > - Add some correctness tests where we do reassociate > - Remove unused TestInfo parameter. Have some tests exit mid-loop. > - ... and 7 more: https://git.openjdk.org/jdk/compare/20189628...32cb9c0d Code looks good, running testing now... Ping me again in 2 days if I don't report back by then ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/17375#issuecomment-2017306067 From jkarthikeyan at openjdk.org Mon Mar 25 06:26:32 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 25 Mar 2024 06:26:32 GMT Subject: Integrated: 8324655: Identify integer minimum and maximum patterns created with if statements In-Reply-To: References: Message-ID: On Thu, 25 Jan 2024 18:15:21 GMT, Jasmine Karthikeyan wrote: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! This pull request has now been integrated. Changeset: 9f920b9b Author: Jasmine Karthikeyan Committer: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/9f920b9bbf8a64e2c2db085cf3da30db37c0d1bc Stats: 708 lines in 7 files changed: 704 ins; 0 del; 4 mod 8324655: Identify integer minimum and maximum patterns created with if statements Reviewed-by: epeter, qamai ------------- PR: https://git.openjdk.org/jdk/pull/17574 From chagedorn at openjdk.org Mon Mar 25 07:43:23 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 25 Mar 2024 07:43:23 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v3] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 16:48:35 GMT, Roland Westrelin wrote: >> The assert fails because peeling happens at a single entry >> `Region`. That `Region` only has a single input because other inputs >> were found unreachable and removed by >> `PhaseIdealLoop::Dominators()`. The fix I propose is to have >> `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s >> entirely in this case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18353#pullrequestreview-1957034480 From chagedorn at openjdk.org Mon Mar 25 07:43:23 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 25 Mar 2024 07:43:23 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v3] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 16:40:32 GMT, Roland Westrelin wrote: > Thanks for reviewing this. > > > Removing the regions during `Dominators()` seems reasonable. I guess doing a pass of IGVN after `Dominators()` is probably too much to get these regions removed? > > It does feel like a lot of overhead for such a simple corner case. Indeed, I don't think it's worth. > > Could you also remove the region and the phi where the unreachable loops are cleaned up and the region and phis become single-entry nodes? I.e. here: > > I considered it but it felt harder. The algorithm collects cfg nodes in dfs and then iterate over them several times. If we remove a region at the point you mention, it would need to be removed from the dfs node list. Or we would need to make later iterations over the dfs node list handle dead region nodes. It's not a problem when it's done later as I propose here. I see, thanks for the explanation. Then it makes sense to handle this edge-case like you proposed to keep things simple. Maybe you can add a comment accordingly why we remove the region at this point. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18353#discussion_r1537156282 From chagedorn at openjdk.org Mon Mar 25 08:02:23 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 25 Mar 2024 08:02:23 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 16:33:30 GMT, Roland Westrelin wrote: >> Both failures occur because `ABS(scale * stride_con)` overflows (scale >> a really large long number). I reworked the test so overflow is no >> longer an issue. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18397#pullrequestreview-1957061118 From ihse at openjdk.org Mon Mar 25 09:14:25 2024 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 25 Mar 2024 09:14:25 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Sat, 23 Mar 2024 09:42:33 GMT, Andrew Haley wrote: > And neither should we compile or link it with "-fvisibility=hidden". That is the root of this problem. If you suggest that we should not compile hsdis with hidden visibility, I disagree. I have been working hard on unifying build of native libraries across the entire product, to fix holes where we have not used a consistent way of compiling and/or linking. There is no reason to tread hsdis differently. If I restore using hidden visibility as an option that all native libraries, except hsdis, must opt in to, then we are just back to square one, and suddenly someone will forget about it. Instead, now we set -fvisibility=hidden in configure so nobody can forget about it. That was the more general argument for alinging compilation and linking flags and behavior across the product. Regarding the symbol visibility more specifically, there is also the question of consistency across platforms. On Windows, the behavior corresponding to "-fvisibility=hidden" is always turned on. Functions can only be exported if they are explicitly annotated in the source code (or specified otherwise to the linker). So we are in any case forced to export functions on Windows. Let's have a look at this patch. Currently we have code like: #ifdef _WIN32 __declspec(dllexport) #endif Robbin proposes to change this to #if defined(_WIN32) __declspec(dllexport) #elif defined(_GNU_SOURCE) __attribute__ ((visibility ("default"))) #endif My counter-proposal was to replace it with just `JNIEXPORT`. Surely you can't say that is a worse solution? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2017526903 From ihse at openjdk.org Mon Mar 25 09:14:26 2024 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 25 Mar 2024 09:14:26 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 06:58:43 GMT, Robbin Ehn wrote: >> Hi, please consider. >> >> [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. >> Tested with gcc and clang, and llvm and binutils backend. >> >> I didn't find any use of the "DLL_ENTRY", so I removed it. >> >> Thanks, Robbin > > Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: > > remove swap file (In fact, I think we have a problem everywhere in the code base where someone is using `__declspec(dllexport)` directly) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2017527991 From galder at openjdk.org Mon Mar 25 09:18:24 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 25 Mar 2024 09:18:24 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v7] In-Reply-To: <-xxO4c_DQN6d1OPXISLhO-nAQ9-rNKKv8F7XMDXlZes=.6af80818-d0c7-4957-82f0-7498187d64cd@github.com> References: <-xxO4c_DQN6d1OPXISLhO-nAQ9-rNKKv8F7XMDXlZes=.6af80818-d0c7-4957-82f0-7498187d64cd@github.com> Message-ID: On Thu, 21 Mar 2024 02:39:35 GMT, Dean Long wrote: > I only wanted the comments around the boilerplate force_reexecute() logic, but if you are happy with my idea to move that logic into LIRGenerator::state_for then the comment could go there. If not, I may look at it in a follow-up RFE, because I would like to get rid of the force_reexecute() hack that I added and see if I can instead tie it to the use of state_before() or ValueStack::StateBefore. It might be better handled as a follow-up. The reason for the boilerplate code to look the way it does is because the original code didn't set `set_force_reexecute`. So by shaping the code in that way, I was trying to limit the impact of my changes to the specific case of my clone intrinsic, where both state before and force reexecute are set for the new type array and array copy intrinsics. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-2017534977 From galder at openjdk.org Mon Mar 25 09:51:26 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 25 Mar 2024 09:51:26 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v7] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 03:01:24 GMT, Dean Long wrote: > I don't think target-specific logic belongs here. And I don't understand the point about Phi nodes. Isn't the holder_known flag enough? In my testing `holder_known` was not enough to detect objects that are not Phi. For example: static int[] test(int[] ints) { return ints.clone(); } `holder_known` is false when it tries to C1 compile `ints.clone()`, am I missing something here? > For primitive arrays, isn't it true that inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone? Possibly, but in this part of the logic I'm trying to find situations in which I don't want to apply the `clone` intrinsic. And those situations are non-array objects, and for arrays, those whose elements are not primitives. I don't see how I can craft such a condition with only `inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone`? IOW, that condition might be true for primitive arrays, but is it false for non-array objects and non-primitive arrays? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17667#discussion_r1537319131 From chagedorn at openjdk.org Mon Mar 25 10:26:36 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 25 Mar 2024 10:26:36 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 14:25:57 GMT, Emanuel Peter wrote: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. Nice refactoring. Here are some first comments about the pair set. Will continue later. Just a side note: Would it have been possible to split this RFE into a pair set and a pack set refactoring separately without big efforts due to dependencies between them? It might simplify the review. src/hotspot/share/opto/superword.cpp line 1038: > 1036: > 1037: // Extend pairset by following use->def and def->use links from pair members. > 1038: void SuperWord::extend_pairset_with_more_pairs_by_following_use_and_def() { Could this method (and possibly other methods only called in the context of this method) also be part of the new `PairSet` class? src/hotspot/share/opto/superword.cpp line 1144: > 1142: } > 1143: #endif > 1144: bool changed = false; I think you could directly return true or false at the very end of the method where you set the flag. src/hotspot/share/opto/superword.cpp line 1215: > 1213: } > 1214: > 1215: // For a def-pair (def1. def2), and their use-nodes (use1, use2): Suggestion: // For a def-pair (def1, def2), and their use-nodes (use1, use2): src/hotspot/share/opto/superword.cpp line 1216: > 1214: > 1215: // For a def-pair (def1. def2), and their use-nodes (use1, use2): > 1216: // ensure that the input order of (use1, use2) matches the order of (def1, def2). Suggestion: // Ensure that the input order of (use1, use2) matches the order of (def1, def2). src/hotspot/share/opto/superword.cpp line 1230: > 1228: // 2: Inputs of (use1, use2) already match (def1, def2), i.e. for all input indices i: > 1229: // > 1230: // use1->in(i) == def1 || use2->in(def2) -> use1->in(i) == def1 && use2->in(def2) Should this be `use2->in(i) == def2`? src/hotspot/share/opto/superword.cpp line 1232: > 1230: // use1->in(i) == def1 || use2->in(def2) -> use1->in(i) == def1 && use2->in(def2) > 1231: // > 1232: // 3: Add/Mul (use1, use2): we can try to swap edges: Is it not required for other nodes? src/hotspot/share/opto/superword.cpp line 1248: > 1246: // Therefore, extend_pairset_with_more_pairs_by_following_use cannot extend to MulAddS2I, > 1247: // but there is a chance that extend_pairset_with_more_pairs_by_following_def can do it. > 1248: // Nice summary :-) Just a suggestion, should we move the second case (already ordered) last to match the order in the code below? Then you could say that in all other cases, we're good, i.e. src/hotspot/share/opto/superword.cpp line 1254: > 1252: // 1. Reduction > 1253: if (is_marked_reduction(use1) && is_marked_reduction(use2)) { > 1254: Node* first = use1->in(2); Was like that before but shouldn't this be named `second` or `second_input` instead of `first`? It's gonna be the first input only after the swap. src/hotspot/share/opto/superword.cpp line 1282: > 1280: use2->swap_edges(3, 4); > 1281: } > 1282: if (i1 == 3 - i2 || i1 == 7 - i2) { // ((i1 == 1 && i2 == 2) || (i1 == 2 && i2 == 1) || (i1 == 3 && i2 == 4) || (i1 == 4 && i2 == 3)) Both comments are a bit long, should we wrap them? Maybe like that: // (i1 == 3 && i2 == 2) // (i1 == 2 && i2 == 3) or // (i1 == 1 && i2 == 4) or // (i1 == 4 && i2 == 1) or if (i1 == 5 - i2) { ... src/hotspot/share/opto/superword.cpp line 1288: > 1286: return PairOrderStatus::Unknown; > 1287: } else { > 1288: // The inputs are not ordered, and we can not do anything about it. Suggestion: // The inputs are not ordered, and we cannot do anything about it. src/hotspot/share/opto/superword.cpp line 1303: > 1301: } > 1302: > 1303: // Estimate the savings from executing s1 and s2 as a pack Suggestion: // Estimate the savings from executing s1 and s2 as a pair. src/hotspot/share/opto/superword.cpp line 1304: > 1302: > 1303: // Estimate the savings from executing s1 and s2 as a pack > 1304: int SuperWord::estimate_cost_savings_when_packing_pair(const Node* s1, const Node* s2) const { Nit: Should we add a `as` or `to`? Suggestion: int SuperWord::estimate_cost_savings_when_packing_as_pair(const Node* s1, const Node* s2) const { src/hotspot/share/opto/superword.cpp line 1307: > 1305: int save_in = 2 - 1; // 2 operations per instruction in packed form > 1306: > 1307: auto adjacent_profit = [&] (Node* s1, Node* s2) { return 2; }; You can remove `s1` and `s2` since you always return 2. src/hotspot/share/opto/superword.cpp line 1338: > 1336: for (DUIterator_Fast kmax, k = s2->fast_outs(kmax); k < kmax; k++) { > 1337: if (use2 == s2->fast_out(k)) { > 1338: ct++; Maybe add the following as a visual aid? s1 s2 | | [use1, use2] src/hotspot/share/opto/superword.cpp line 1363: > 1361: for (PairSetIterator pair(_pairset); !pair.done(); pair.next()) { > 1362: Node* s1 = pair.left(); > 1363: Node* s2 = pair.right(); Maybe you also want to name them `left` and `right` instead of `s1` and `s2`. src/hotspot/share/opto/superword.cpp line 1364: > 1362: Node* s1 = pair.left(); > 1363: Node* s2 = pair.right(); > 1364: if (_pairset.is_left_in_a_left_most_pair(s1)) { Should we also assert here that `pack == nullptr` before creating the new list? src/hotspot/share/opto/superword.cpp line 1368: > 1366: pack->push(s1); > 1367: } > 1368: pack->push(s2); We should also assert that at this point, `pack != nullptr`. ------------- PR Review: https://git.openjdk.org/jdk/pull/18276#pullrequestreview-1957111113 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537203711 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537268401 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537243897 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537244514 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537332755 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537338269 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537324987 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537255606 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537323963 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537318162 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537304354 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537305714 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537287795 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537315499 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537351581 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537352946 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537353730 From dlunden at openjdk.org Mon Mar 25 10:57:22 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 25 Mar 2024 10:57:22 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 17:16:19 GMT, Vladimir Kozlov wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Elaborate in LoadN comment > > Good. > >> Add label to JBS something `test*hard`. I don't remember its exact name. > > Never mind I see you did it already. Thanks for the reviews @vnkozlov and @robcasloz. Please sponsor! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18434#issuecomment-2017723598 From epeter at openjdk.org Mon Mar 25 11:14:32 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 11:14:32 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 08:30:07 GMT, Christian Hagedorn wrote: >> I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. >> >> Most importantly: I split it into two classes: `PairSet` and `PackSet`. >> `combine_pairs_to_longer_packs` converts the first into the second. >> >> I was able to simplify the combining, and remove the pack-sorting. >> I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. >> >> I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. >> >> I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: >> Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). >> >> But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. >> >> More details are described in the annotations in the code. > > src/hotspot/share/opto/superword.cpp line 1038: > >> 1036: >> 1037: // Extend pairset by following use->def and def->use links from pair members. >> 1038: void SuperWord::extend_pairset_with_more_pairs_by_following_use_and_def() { > > Could this method (and possibly other methods only called in the context of this method) also be part of the new `PairSet` class? Not really. Inside we need lots of access to the SuperWord components (check `alignment`, `stmts_can_pack`, etc). You could always pass in a `SuperWord` reference, but that is not really nicer. Maybe it would be easier in the future, once I make some other changes. But we will always need to have other information available (dependency graph, types, etc - these are all VLoopAnalyzer submodules), but for now we also rely on `alignment`, which I plan to remove, and lives in SuperWord. > src/hotspot/share/opto/superword.cpp line 1144: > >> 1142: } >> 1143: #endif >> 1144: bool changed = false; > > I think you could directly return true or false at the very end of the method where you set the flag. Good point! > src/hotspot/share/opto/superword.cpp line 1230: > >> 1228: // 2: Inputs of (use1, use2) already match (def1, def2), i.e. for all input indices i: >> 1229: // >> 1230: // use1->in(i) == def1 || use2->in(def2) -> use1->in(i) == def1 && use2->in(def2) > > Should this be `use2->in(i) == def2`? Yes! > src/hotspot/share/opto/superword.cpp line 1248: > >> 1246: // Therefore, extend_pairset_with_more_pairs_by_following_use cannot extend to MulAddS2I, >> 1247: // but there is a chance that extend_pairset_with_more_pairs_by_following_def can do it. >> 1248: // > > Nice summary :-) Just a suggestion, should we move the second case (already ordered) last to match the order in the code below? Then you could say that in all other cases, we're good, i.e. Sounds good :) > src/hotspot/share/opto/superword.cpp line 1254: > >> 1252: // 1. Reduction >> 1253: if (is_marked_reduction(use1) && is_marked_reduction(use2)) { >> 1254: Node* first = use1->in(2); > > Was like that before but shouldn't this be named `second` or `second_input` instead of `first`? It's gonna be the first input only after the swap. I agree, it already was like that, and it disturbed it me too. I thought I'd leave it since I'll probably overhaul this whole code soon anyway. But I'll fix it already since it disturbs you too. > src/hotspot/share/opto/superword.cpp line 1282: > >> 1280: use2->swap_edges(3, 4); >> 1281: } >> 1282: if (i1 == 3 - i2 || i1 == 7 - i2) { // ((i1 == 1 && i2 == 2) || (i1 == 2 && i2 == 1) || (i1 == 3 && i2 == 4) || (i1 == 4 && i2 == 3)) > > Both comments are a bit long, should we wrap them? Maybe like that: > > // (i1 == 3 && i2 == 2) > // (i1 == 2 && i2 == 3) or > // (i1 == 1 && i2 == 4) or > // (i1 == 4 && i2 == 1) or > if (i1 == 5 - i2) { > ... I'd rather not spend too much time on this, since I'll overhaul it soon even more. I'm not touching the comments anyway ;) > src/hotspot/share/opto/superword.cpp line 1307: > >> 1305: int save_in = 2 - 1; // 2 operations per instruction in packed form >> 1306: >> 1307: auto adjacent_profit = [&] (Node* s1, Node* s2) { return 2; }; > > You can remove `s1` and `s2` since you always return 2. Sure ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537412950 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537416655 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537419417 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537418777 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537415805 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537418452 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537416752 From epeter at openjdk.org Mon Mar 25 11:22:30 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 11:22:30 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 10:04:13 GMT, Christian Hagedorn wrote: >> I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. >> >> Most importantly: I split it into two classes: `PairSet` and `PackSet`. >> `combine_pairs_to_longer_packs` converts the first into the second. >> >> I was able to simplify the combining, and remove the pack-sorting. >> I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. >> >> I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. >> >> I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: >> Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). >> >> But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. >> >> More details are described in the annotations in the code. > > src/hotspot/share/opto/superword.cpp line 1232: > >> 1230: // use1->in(i) == def1 || use2->in(def2) -> use1->in(i) == def1 && use2->in(def2) >> 1231: // >> 1232: // 3: Add/Mul (use1, use2): we can try to swap edges: > > Is it not required for other nodes? We can only do it for associative nodes like `Mul / Add`, and of course all the nodes that are subclasses of those (e.g. Max, Min, Or, etc). I can improve the comment, since that may not be immediately clear to the reader. > src/hotspot/share/opto/superword.cpp line 1363: > >> 1361: for (PairSetIterator pair(_pairset); !pair.done(); pair.next()) { >> 1362: Node* s1 = pair.left(); >> 1363: Node* s2 = pair.right(); > > Maybe you also want to name them `left` and `right` instead of `s1` and `s2`. Just specifically here, or everywhere in the code? > src/hotspot/share/opto/superword.cpp line 1364: > >> 1362: Node* s1 = pair.left(); >> 1363: Node* s2 = pair.right(); >> 1364: if (_pairset.is_left_in_a_left_most_pair(s1)) { > > Should we also assert here that `pack == nullptr` before creating the new list? Good idea! > src/hotspot/share/opto/superword.cpp line 1368: > >> 1366: pack->push(s1); >> 1367: } >> 1368: pack->push(s2); > > We should also assert that at this point, `pack != nullptr`. Can do that, but I think it would just result in a nullptr exception anyway. I always wonder if an assert makes the code more cluttered, so less readable, or if it states expectations more explicitly, which makes the code more readable ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537423998 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537425184 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537425986 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537427794 From rehn at openjdk.org Mon Mar 25 11:30:22 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 25 Mar 2024 11:30:22 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 09:12:19 GMT, Magnus Ihse Bursie wrote: > (In fact, I think we have a problem everywhere in the code base where someone is using `__declspec(dllexport)` directly) src/java.base/share/native/libzip/zlib/zconf.h:# define ZEXPORT __declspec(dllexport) ZEXPORT is only defined on win32, so we must build with default on linux. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2017790938 From ihse at openjdk.org Mon Mar 25 11:44:22 2024 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 25 Mar 2024 11:44:22 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 11:27:17 GMT, Robbin Ehn wrote: > src/java.base/share/native/libzip/zlib/zconf.h:# define ZEXPORT __declspec(dllexport) zlib is third party source that is copied into our repo. I did not mean that what I said for that applies to it. (Also, we should not really export *any* functions from zlib -- we encapsulate it in our own native shared library. I might have to look into if we are doing this wrong on Windows. But that is a separate question from this PR.) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2017814752 From rehn at openjdk.org Mon Mar 25 11:53:24 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 25 Mar 2024 11:53:24 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 06:58:43 GMT, Robbin Ehn wrote: >> Hi, please consider. >> >> [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. >> Tested with gcc and clang, and llvm and binutils backend. >> >> I didn't find any use of the "DLL_ENTRY", so I removed it. >> >> Thanks, Robbin > > Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: > > remove swap file > > src/java.base/share/native/libzip/zlib/zconf.h:# define ZEXPORT __declspec(dllexport) > > zlib is third party source that is copied into our repo. I did not mean that what I said for that applies to it. (Also, we should not really export _any_ functions from zlib -- we encapsulate it in our own native shared library. I might have to look into if we are doing this wrong on Windows. But that is a separate question from this PR.) Okay, so the difference is that me and @theRealAph consider hsdis a thrid party library, although the JDK folks maintain it. (correct me if I'm wrong) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2017828723 From chagedorn at openjdk.org Mon Mar 25 11:59:33 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 25 Mar 2024 11:59:33 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 11:06:24 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/superword.cpp line 1038: >> >>> 1036: >>> 1037: // Extend pairset by following use->def and def->use links from pair members. >>> 1038: void SuperWord::extend_pairset_with_more_pairs_by_following_use_and_def() { >> >> Could this method (and possibly other methods only called in the context of this method) also be part of the new `PairSet` class? > > Not really. Inside we need lots of access to the SuperWord components (check `alignment`, `stmts_can_pack`, etc). You could always pass in a `SuperWord` reference, but that is not really nicer. > Maybe it would be easier in the future, once I make some other changes. But we will always need to have other information available (dependency graph, types, etc - these are all VLoopAnalyzer submodules), but for now we also rely on `alignment`, which I plan to remove, and lives in SuperWord. Passing a `SuperWord` reference is not wrong per se as `PairSet` is only used together with `SuperWord` and needs to have some information about it. A follow-up question arises if `stmts_can_pack()` and `find_adjacent_refs()` should then not also be part of `PairSet` together with this method. But anyway, since you plan to remove `alignment`, we could get back to this question there. Here, you only move/rename the code, so it's fine to leave it as it is now. >> src/hotspot/share/opto/superword.cpp line 1232: >> >>> 1230: // use1->in(i) == def1 || use2->in(def2) -> use1->in(i) == def1 && use2->in(def2) >>> 1231: // >>> 1232: // 3: Add/Mul (use1, use2): we can try to swap edges: >> >> Is it not required for other nodes? > > We can only do it for associative nodes like `Mul / Add`, and of course all the nodes that are subclasses of those (e.g. Max, Min, Or, etc). I can improve the comment, since that may not be immediately clear to the reader. Thanks for the explanation. Yes, maybe you can clarify the comment :-) >> src/hotspot/share/opto/superword.cpp line 1363: >> >>> 1361: for (PairSetIterator pair(_pairset); !pair.done(); pair.next()) { >>> 1362: Node* s1 = pair.left(); >>> 1363: Node* s2 = pair.right(); >> >> Maybe you also want to name them `left` and `right` instead of `s1` and `s2`. > > Just specifically here, or everywhere in the code? Good question, we could probably also try to apply this renaming at other places. If it's not too much work, you can try it. >> src/hotspot/share/opto/superword.cpp line 1368: >> >>> 1366: pack->push(s1); >>> 1367: } >>> 1368: pack->push(s2); >> >> We should also assert that at this point, `pack != nullptr`. > > Can do that, but I think it would just result in a nullptr exception anyway. > I always wonder if an assert makes the code more cluttered, so less readable, or if it states expectations more explicitly, which makes the code more readable ? Right, well then you can leave it as is. One sometimes tends to be paranoid when it comes to using asserts in C2 :-) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537466361 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537467522 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537468809 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537470307 From thartmann at openjdk.org Mon Mar 25 12:07:27 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Mar 2024 12:07:27 GMT Subject: RFR: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 08:19:28 GMT, Daniel Lund?n wrote: >> The [`assert`](https://github.com/openjdk/jdk/blob/6f2676dc5f09d350c359f906b07f6f6d0d17f030/src/hotspot/share/opto/graphKit.cpp#L1567) added in [JDK-8310524](https://bugs.openjdk.org/browse/JDK-8310524) is too strong and may sometimes not hold due to the [GVN transformation in `LoadNode::make`](https://github.com/openjdk/jdk/blob/8cb9b479c529c058aee50f83920db650b0c18045/src/hotspot/share/opto/memnode.cpp#L973). >> >> ### Changeset >> Remove the `assert`. >> >> ### Testing >> N/A > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Elaborate in LoadN comment ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18434#pullrequestreview-1957552109 From dlunden at openjdk.org Mon Mar 25 12:07:28 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 25 Mar 2024 12:07:28 GMT Subject: Integrated: 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 15:31:58 GMT, Daniel Lund?n wrote: > The [`assert`](https://github.com/openjdk/jdk/blob/6f2676dc5f09d350c359f906b07f6f6d0d17f030/src/hotspot/share/opto/graphKit.cpp#L1567) added in [JDK-8310524](https://bugs.openjdk.org/browse/JDK-8310524) is too strong and may sometimes not hold due to the [GVN transformation in `LoadNode::make`](https://github.com/openjdk/jdk/blob/8cb9b479c529c058aee50f83920db650b0c18045/src/hotspot/share/opto/memnode.cpp#L973). > > ### Changeset > Remove the `assert`. > > ### Testing > N/A This pull request has now been integrated. Changeset: 0c1b254b Author: Daniel Lund?n Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/0c1b254be9ddd3883313f80b61229eacf09aa862 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod 8326438: C2: assert(ld->in(1)->Opcode() == Op_LoadN) failed: Assumption invalid: input to DecodeN is not LoadN Reviewed-by: kvn, rcastanedalo, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/18434 From epeter at openjdk.org Mon Mar 25 12:14:42 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 12:14:42 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v2] In-Reply-To: References: Message-ID: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply Christian's suggestions Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18276/files - new: https://git.openjdk.org/jdk/pull/18276/files/bdc57434..e19b112e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18276/head:pull/18276 PR: https://git.openjdk.org/jdk/pull/18276 From thartmann at openjdk.org Mon Mar 25 12:17:22 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Mar 2024 12:17:22 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v3] In-Reply-To: References: Message-ID: <6RPWTlBusBky6DMC9U0BhbG6doPziznq8RsBHOwSypg=.e5c027ab-3c0c-420e-ad47-801a71523713@github.com> On Fri, 22 Mar 2024 16:48:35 GMT, Roland Westrelin wrote: >> The assert fails because peeling happens at a single entry >> `Region`. That `Region` only has a single input because other inputs >> were found unreachable and removed by >> `PhaseIdealLoop::Dominators()`. The fix I propose is to have >> `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s >> entirely in this case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18353#pullrequestreview-1957586830 From epeter at openjdk.org Mon Mar 25 12:26:33 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 12:26:33 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v2] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 09:39:08 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply Christian's suggestions >> >> Co-authored-by: Christian Hagedorn > > src/hotspot/share/opto/superword.cpp line 1304: > >> 1302: >> 1303: // Estimate the savings from executing s1 and s2 as a pack >> 1304: int SuperWord::estimate_cost_savings_when_packing_pair(const Node* s1, const Node* s2) const { > > Nit: Should we add a `as` or `to`? > Suggestion: > > int SuperWord::estimate_cost_savings_when_packing_as_pair(const Node* s1, const Node* s2) const { done > src/hotspot/share/opto/superword.cpp line 1338: > >> 1336: for (DUIterator_Fast kmax, k = s2->fast_outs(kmax); k < kmax; k++) { >> 1337: if (use2 == s2->fast_out(k)) { >> 1338: ct++; > > Maybe add the following as a visual aid? > > s1 s2 > | | > [use1, use2] Sure ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537498286 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537502548 From thartmann at openjdk.org Mon Mar 25 12:30:23 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Mar 2024 12:30:23 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 16:34:48 GMT, Roland Westrelin wrote: >> In my opinion ABS() should assert that it has legal input (not MIN_INT) and output (non-negative value) in debug builds. > > Thanks for reviewing this. > We can't get to this code if `stride_con` is `MIN_INT` because some other condition (that doesn't explicitly check that `stride_con` is not `MIN_INT`) causes a bail out from the transformation. I added an explicit bail out in that case in a new commit anyway to make the code more robust. > In my opinion ABS() should assert that it has legal input (not MIN_INT) and output (non-negative value) in debug builds. I agree and filed [JDK-8328934](https://bugs.openjdk.org/browse/JDK-8328934) for that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1537509551 From duke at openjdk.org Mon Mar 25 12:37:28 2024 From: duke at openjdk.org (RacerZ) Date: Mon, 25 Mar 2024 12:37:28 GMT Subject: RFR: 8324241: Always record evol_method deps to avoid excessive method flushing [v5] In-Reply-To: References: Message-ID: On Fri, 26 Jan 2024 13:14:57 GMT, Volker Simonis wrote: >> Currently we don't record dependencies on redefined methods (i.e. `evol_method` dependencies) in JIT compiled methods if none of the `can_redefine_classes`, `can_retransform_classes` or `can_generate_breakpoint_events` JVMTI capabalities is set. This means that if a JVMTI agent which requests one of these capabilities is dynamically attached, all the methods which have been JIT compiled until that point, will be marked for deoptimization and flushed from the code cache. For large, warmed-up applications this mean deoptimization and instant recompilation of thousands if not then-thousands of methods, which can lead to dramatic performance/latency drop-downs for several minutes. >> >> One could argue that dynamic agent attach is now deprecated anyway (see [JEP 451: Prepare to Disallow the Dynamic Loading of Agents](https://openjdk.org/jeps/451)) and this problem could be solved by making the recording of `evol_method` dependencies dependent on the new `-XX:+EnableDynamicAgentLoading` flag isntead of the concrete JVMTI capabilities (because the presence of the flag indicates that an agent will be loaded eventually). >> >> But there a single, however important exception to this rule and that's JFR. JFR is advertised as low overhead profiler which can be enabled in production at any time. However, when JFR is started dynamically (e.g. through JCMD or JMX) it will silently load a HotSpot internl JVMTI agent which requests the `can_retransform_classes` and retransforms some classes. This will inevitably trigger the deoptimization of all compiled methods as described above. >> >> I'd therefor like to propose to *always* and unconditionally record `evol_method` dependencies in JIT compiled code by exporting the relevant properties right at startup in `init_globals()`: >> ```c++ >> jint init_globals() { >> management_init(); >> JvmtiExport::initialize_oop_storage(); >> +#if INCLUDE_JVMTI >> + JvmtiExport::set_can_hotswap_or_post_breakpoint(true); >> + JvmtiExport::set_all_dependencies_are_recorded(true); >> +#endif >> >> >> My measurements indicate that the overhead of doing so is minimal (around 1% increase of nmethod size) and justifies the benefit. E.g. a Spring Petclinic application started with `-Xbatch -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation` compiles about ~11500 methods (~9000 with C1 and ~2500 with C2) resulting in an aggregated nmethod size of around ~40bm. Additionally recording `evol_method` dependencies only increases this size be about 400kb.... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Fixed whitepspace in flag documentation I've noticed that the evol_method dependency is closely related to inline optimization. I'm curious if there is a way to extract evol_method information from the results of inline optimization (perhaps nmethod)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17509#issuecomment-2017904506 From epeter at openjdk.org Mon Mar 25 12:47:46 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 12:47:46 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v3] In-Reply-To: References: Message-ID: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fixes for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18276/files - new: https://git.openjdk.org/jdk/pull/18276/files/e19b112e..dcb90a61 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=01-02 Stats: 41 lines in 2 files changed: 12 ins; 5 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/18276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18276/head:pull/18276 PR: https://git.openjdk.org/jdk/pull/18276 From roland at openjdk.org Mon Mar 25 12:52:40 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 12:52:40 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v12] In-Reply-To: References: Message-ID: > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 19 commits: - test fix - Merge branch 'master' into JDK-8320649 - whitespaces - review - Merge branch 'master' into JDK-8320649 - review - 32 bit build fix - fix & test - Merge branch 'master' into JDK-8320649 - review - ... and 9 more: https://git.openjdk.org/jdk/compare/784f11c3...3f312f8f ------------- Changes: https://git.openjdk.org/jdk/pull/16966/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=11 Stats: 2648 lines in 39 files changed: 2579 ins; 29 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From roland at openjdk.org Mon Mar 25 12:52:40 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 12:52:40 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Thu, 21 Mar 2024 13:11:20 GMT, Christian Hagedorn wrote: > You can try to use `TestFramework::assertDeoptimizedByC2()` which skips the assertion for some unstable setups like having `PerMethodTrapLimit == 0`: Thanks for the suggestion! I used it to fix the test. @eme64 would you mind re-running tests? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2017930933 From ihse at openjdk.org Mon Mar 25 12:57:30 2024 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 25 Mar 2024 12:57:30 GMT Subject: RFR: 8328614: hsdis: dlsym can't find decode symbol [v2] In-Reply-To: References: Message-ID: <6ICXEnUGz2AY7gH0nCQq7WrV8wrKnFHyVLfrtzjcz3Y=.25891693-6822-4af5-b6ed-ee7034371898@github.com> On Thu, 21 Mar 2024 06:58:43 GMT, Robbin Ehn wrote: >> Hi, please consider. >> >> [8327045](https://bugs.openjdk.org/browse/JDK-8327045) hide these symbols. >> Tested with gcc and clang, and llvm and binutils backend. >> >> I didn't find any use of the "DLL_ENTRY", so I removed it. >> >> Thanks, Robbin > > Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: > > remove swap file It is original code in the JDK library. zlib is copied in from elsewhere. That is what I mean by third party code. The only reason hsdis is specially treated is due to the fact that the original implementation was based on binutils, which could not -- for legal reason -- be distributed as a compiled library alongside the JDK. That forced us to erect an artificial barrier. There are multiple JBS issues trying to rectify this, with different approaches, to get a non-binutils hsdis library build and shipped as any other native library in the JDK. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18400#issuecomment-2017939559 From roland at openjdk.org Mon Mar 25 12:57:31 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 12:57:31 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Fri, 22 Mar 2024 16:53:57 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > Good. @vnkozlov @chhagedorn for the reviews ------------- PR Comment: https://git.openjdk.org/jdk/pull/18397#issuecomment-2017936214 From roland at openjdk.org Mon Mar 25 12:57:33 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 12:57:33 GMT Subject: Integrated: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks In-Reply-To: References: Message-ID: <0doXLyh5uNY0Ke6-TCwvZ2BAvZI-FAips5uflvXDBUA=.830194a3-8796-453d-9d7d-95ad61a8925d@github.com> On Wed, 20 Mar 2024 12:17:03 GMT, Roland Westrelin wrote: > Both failures occur because `ABS(scale * stride_con)` overflows (scale > a really large long number). I reworked the test so overflow is no > longer an issue. This pull request has now been integrated. Changeset: cb2a6713 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/cb2a6713596548d76c03912709656172b0bbcc76 Stats: 70 lines in 3 files changed: 61 ins; 0 del; 9 mod 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/18397 From epeter at openjdk.org Mon Mar 25 12:58:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 12:58:55 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: References: Message-ID: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: use left/right instead of s1/s2 in some obvious simple places ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18276/files - new: https://git.openjdk.org/jdk/pull/18276/files/dcb90a61..d4136bba Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=02-03 Stats: 18 lines in 1 file changed: 0 ins; 0 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/18276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18276/head:pull/18276 PR: https://git.openjdk.org/jdk/pull/18276 From epeter at openjdk.org Mon Mar 25 12:58:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 12:58:55 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 11:55:07 GMT, Christian Hagedorn wrote: >> Just specifically here, or everywhere in the code? > > Good question, we could probably also try to apply this renaming at other places. If it's not too much work, you can try it. I did it in a few places, but not everywhere. Sometimes it is just easier to talk about `s1, s2` and `use1, use2` and `def1, def2` with numbers, rather than the much longer names using `left/right`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1537552173 From epeter at openjdk.org Mon Mar 25 13:03:33 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 13:03:33 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 10:23:52 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > Nice refactoring. Here are some first comments about the pair set. Will continue later. > > Just a side note: Would it have been possible to split this RFE into a pair set and a pack set refactoring separately without big efforts due to dependencies between them? It might simplify the review. @chhagedorn I think I have addressed all your remarks. I agree I could have probably split the `PairSet` and `PackSet` into separate RFE's, that only became apparent once I did the work, and that there is not really any shared code between the two. Though I'm not sure it is worth splitting this up now, after all they are really about the `_packset`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18276#issuecomment-2017954018 From roland at openjdk.org Mon Mar 25 13:32:38 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 13:32:38 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v4] In-Reply-To: References: Message-ID: > The assert fails because peeling happens at a single entry > `Region`. That `Region` only has a single input because other inputs > were found unreachable and removed by > `PhaseIdealLoop::Dominators()`. The fix I propose is to have > `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s > entirely in this case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18353/files - new: https://git.openjdk.org/jdk/pull/18353/files/d201aee2..8fbb6f96 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18353&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18353&range=02-03 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18353.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18353/head:pull/18353 PR: https://git.openjdk.org/jdk/pull/18353 From roland at openjdk.org Mon Mar 25 13:32:38 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 13:32:38 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v4] In-Reply-To: References: Message-ID: <6mhAMF5bEe_RKjjCUDCtsGVp-6auKj1oq3ivN0GOUcQ=.c60c64df-d8df-4b8a-9301-da73d0f791f1@github.com> On Mon, 25 Mar 2024 07:40:21 GMT, Christian Hagedorn wrote: > I see, thanks for the explanation. Then it makes sense to handle this edge-case like you proposed to keep things simple. Maybe you can add a comment accordingly why we remove the region at this point. Sure. I added a comment. Let me know if it looks ok to you. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18353#discussion_r1537602672 From dlunden at openjdk.org Mon Mar 25 13:33:34 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 25 Mar 2024 13:33:34 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA Message-ID: The library intrinsic `_copyOfRange` does not add a guard for start indices that are larger than the length of the source arrays. Macro expansion of `ArrayCopy` nodes later adds such a guard, but in certain situations escape analysis may result in removing the `ArrayCopy` node before it is expanded. The result is incorrect behavior of the compiled program (as the missing guard may have relevant side effects, such as throwing an exception). ### Changeset - Add the missing guard (start index <= source array length). - Remove an unnecessary guard (end index >= 0) that holds as a result of the other guards. The updated set of guards then more closely follows the `copyOfRange` [Java API documentation](https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/util/Arrays.html#copyOfRange(U[],int,int,java.lang.Class)). - Add a regression test. ### Testing - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/8388044152) - tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. ------------- Commit messages: - Grammar fix - Fix Changes: https://git.openjdk.org/jdk/pull/18472/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18472&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8323682 Stats: 73 lines in 3 files changed: 65 ins; 2 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/18472.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18472/head:pull/18472 PR: https://git.openjdk.org/jdk/pull/18472 From epeter at openjdk.org Mon Mar 25 13:36:26 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 25 Mar 2024 13:36:26 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Mon, 25 Mar 2024 12:49:47 GMT, Roland Westrelin wrote: >>> I see these two extra flag combinations lead to failures: >>> >>> `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >>> >>> `-server -Xcomp` >> >> You can try to use `TestFramework::assertDeoptimizedByC2()` which skips the assertion for some unstable setups like having `PerMethodTrapLimit == 0`: >> https://github.com/openjdk/jdk/blob/700d2b91defd421a2818f53830c24f70d11ba4f6/test/hotspot/jtreg/compiler/lib/ir_framework/test/TestVM.java#L943-L956 > >> You can try to use `TestFramework::assertDeoptimizedByC2()` which skips the assertion for some unstable setups like having `PerMethodTrapLimit == 0`: > > Thanks for the suggestion! I used it to fix the test. > @eme64 would you mind re-running tests? @rwestrel Great, yes just launched it. Feel free to ask in a day or 2 if I don't report back by then! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2018019461 From chagedorn at openjdk.org Mon Mar 25 13:48:27 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 25 Mar 2024 13:48:27 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v4] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 13:32:38 GMT, Roland Westrelin wrote: >> The assert fails because peeling happens at a single entry >> `Region`. That `Region` only has a single input because other inputs >> were found unreachable and removed by >> `PhaseIdealLoop::Dominators()`. The fix I propose is to have >> `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s >> entirely in this case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > comment Looks good, thanks for the update! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18353#pullrequestreview-1957796046 From roland at openjdk.org Mon Mar 25 13:48:27 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 13:48:27 GMT Subject: RFR: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" [v4] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 13:43:20 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> comment > > Looks good, thanks for the update! @chhagedorn @TobiHartmann thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18353#issuecomment-2018039394 From roland at openjdk.org Mon Mar 25 13:48:29 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 13:48:29 GMT Subject: Integrated: 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" In-Reply-To: References: Message-ID: On Mon, 18 Mar 2024 17:15:38 GMT, Roland Westrelin wrote: > The assert fails because peeling happens at a single entry > `Region`. That `Region` only has a single input because other inputs > were found unreachable and removed by > `PhaseIdealLoop::Dominators()`. The fix I propose is to have > `PhaseIdealLoop::Dominators()` remove the `Region` and its `Phi`s > entirely in this case. This pull request has now been integrated. Changeset: af15c68f Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/af15c68f3ccb72537b0a60d942f12d600f13ebb6 Stats: 82 lines in 2 files changed: 81 ins; 0 del; 1 mod 8321278: C2: Partial peeling fails with assert "last_peel <- first_not_peeled" Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/18353 From chagedorn at openjdk.org Mon Mar 25 13:52:29 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 25 Mar 2024 13:52:29 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Mon, 25 Mar 2024 12:49:47 GMT, Roland Westrelin wrote: > > You can try to use `TestFramework::assertDeoptimizedByC2()` which skips the assertion for some unstable setups like having `PerMethodTrapLimit == 0`: > > Thanks for the suggestion! I used it to fix the test. @eme64 would you mind re-running tests? Minor detail: You should use `TestFramework::assertDeoptimizedByC2()` instead of `TestVM::assertDeoptimizedByC2()`. `TestVM` should only be called internally by the framework. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2018050858 From roland at openjdk.org Mon Mar 25 14:19:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 14:19:57 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v13] In-Reply-To: References: Message-ID: > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: test fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16966/files - new: https://git.openjdk.org/jdk/pull/16966/files/3f312f8f..142ca630 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=11-12 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From roland at openjdk.org Mon Mar 25 14:19:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 25 Mar 2024 14:19:57 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Mon, 25 Mar 2024 13:49:57 GMT, Christian Hagedorn wrote: > Minor detail: You should use `TestFramework::assertDeoptimizedByC2()` instead of `TestVM::assertDeoptimizedByC2()`. `TestVM` should only be called internally by the framework. Thanks for checking the change. I fixed it in the new commit. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2018106839 From sgibbons at openjdk.org Mon Mar 25 14:39:59 2024 From: sgibbons at openjdk.org (Scott Gibbons) Date: Mon, 25 Mar 2024 14:39:59 GMT Subject: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v15] In-Reply-To: References: Message-ID: > Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers: > > > Benchmark Score Latest > StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x > StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x > StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x > StringIndexOf.constantPattern 9.361 11.906 1.271872663x > StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x > StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x > StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x > StringIndexOf.success 9.186 9.713 1.057369911x > StringIndexOf.successBig 14.341 46.343 3.231504079x > StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x > StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x > StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x > StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x > StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x > StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x > StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x > StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803 Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: Remove infinite loop (used for debugging) ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16753/files - new: https://git.openjdk.org/jdk/pull/16753/files/e079fc12..1cd1b501 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16753&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16753&range=13-14 Stats: 12 lines in 1 file changed: 0 ins; 2 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/16753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16753/head:pull/16753 PR: https://git.openjdk.org/jdk/pull/16753 From thartmann at openjdk.org Mon Mar 25 14:43:22 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Mar 2024 14:43:22 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 13:28:32 GMT, Daniel Lund?n wrote: > The library intrinsic `_copyOfRange` does not add a guard for start indices that are larger than the length of the source arrays. Macro expansion of `ArrayCopy` nodes later adds such a guard, but in certain situations escape analysis may result in removing the `ArrayCopy` node before it is expanded. The result is incorrect behavior of the compiled program (as the missing guard may have relevant side effects, such as throwing an exception). > > ### Changeset > > - Add the missing guard (start index <= source array length). > - Remove an unnecessary guard (end index >= 0) that holds as a result of the other guards. The updated set of guards then more closely follows the `copyOfRange` [Java API documentation](https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/util/Arrays.html#copyOfRange(U[],int,int,java.lang.Class)). > - Add a regression test. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/8388044152) > - tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. That looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18472#pullrequestreview-1957933626 From dlunden at openjdk.org Mon Mar 25 14:49:23 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 25 Mar 2024 14:49:23 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 13:28:32 GMT, Daniel Lund?n wrote: > The library intrinsic `_copyOfRange` does not add a guard for start indices that are larger than the length of the source arrays. Macro expansion of `ArrayCopy` nodes later adds such a guard, but in certain situations escape analysis may result in removing the `ArrayCopy` node before it is expanded. The result is incorrect behavior of the compiled program (as the missing guard may have relevant side effects, such as throwing an exception). > > ### Changeset > > - Add the missing guard (start index <= source array length). > - Remove an unnecessary guard (end index >= 0) that holds as a result of the other guards. The updated set of guards then more closely follows the `copyOfRange` [Java API documentation](https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/util/Arrays.html#copyOfRange(U[],int,int,java.lang.Class)). > - Add a regression test. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/8388044152) > - tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. src/hotspot/share/opto/library_call.cpp line 4396: > 4394: newcopy = new_array(klass_node, length, 0); // no arguments to push > 4395: > 4396: ArrayCopyNode* ac = ArrayCopyNode::make(this, true, original, start, newcopy, intcon(0), moved, true, true, Note: we can now specify `true` for the argument `has_negative_length_guard`. src/hotspot/share/opto/macroArrayCopy.cpp line 1269: > 1267: adr_type, T_OBJECT, > 1268: src, src_offset, dest, dest_offset, length, > 1269: true, !ac->is_copyofrange()); We cannot use this anymore, since we then ignore the `has_negative_length_guard` for `copyOfRange` (and generate the negative length guard twice). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18472#discussion_r1537712946 PR Review Comment: https://git.openjdk.org/jdk/pull/18472#discussion_r1537716349 From kxu at openjdk.org Mon Mar 25 16:56:52 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 25 Mar 2024 16:56:52 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v5] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - update test @run annotation - improve formatting, correct annotation and rename test class - Merge branch 'master' into boolnode-refactor - update the package name for tests - modification per code review suggestions - fix test by adding the missing inversion also excluding negative values for unsigned comparison - add license header - also test for correctness - exclude x86 from tests - refactor (x & m) u<= m transformation and add test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/e2eb8bf9..47d0172b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=03-04 Stats: 529499 lines in 6190 files changed: 68787 ins; 113225 del; 347487 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Mon Mar 25 16:56:52 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 25 Mar 2024 16:56:52 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 02:05:39 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update the package name for tests Sorry for not following up closely. The PR has been updated per review suggested. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2018465455 From kxu at openjdk.org Mon Mar 25 16:56:52 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 25 Mar 2024 16:56:52 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 12:39:59 GMT, Emanuel Peter wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> update the package name for tests > > test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 58: > >> 56: & !(Integer.compareUnsigned((m & x), m) > 0) >> 57: & Integer.compareUnsigned((x & m), m + 1) < 0 >> 58: & Integer.compareUnsigned((m & x), m + 1) < 0; > > For easier reading, I would have put the `&` at the end of the line. > Btw: is this supposed to be a bitwise or a binary and? Done. I'm intentionally using bitwise `&` here so that the compiler doesn't short-circuits on conditions with `&&`. We wish to test all for correctness. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1537912385 From kvn at openjdk.org Mon Mar 25 16:58:29 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 25 Mar 2024 16:58:29 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 13:28:32 GMT, Daniel Lund?n wrote: > The library intrinsic `_copyOfRange` does not add a guard for start indices that are larger than the length of the source arrays. Macro expansion of `ArrayCopy` nodes later adds such a guard, but in certain situations escape analysis may result in removing the `ArrayCopy` node before it is expanded. The result is incorrect behavior of the compiled program (as the missing guard may have relevant side effects, such as throwing an exception). > > ### Changeset > > - Add the missing guard (start index <= source array length). > - Remove an unnecessary guard (end index >= 0) that holds as a result of the other guards. The updated set of guards then more closely follows the `copyOfRange` [Java API documentation](https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/util/Arrays.html#copyOfRange(U[],int,int,java.lang.Class)). > - Add a regression test. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/8388044152) > - tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. src/hotspot/share/opto/library_call.cpp line 4323: > 4321: // Bail out if either start or end is negative. > 4322: generate_negative_guard(start, bailout, &start); > 4323: generate_negative_guard(end, bailout, &end); I think we need this check to avoid underflow in integer expression (end - start): (min_int - 1) == max_int ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18472#discussion_r1537911056 From kxu at openjdk.org Mon Mar 25 20:10:53 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 25 Mar 2024 20:10:53 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v6] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: also renames the class name in @run ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/47d0172b..02c01edf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From dlong at openjdk.org Mon Mar 25 22:25:30 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 25 Mar 2024 22:25:30 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 12:27:51 GMT, Tobias Hartmann wrote: >> Thanks for reviewing this. >> We can't get to this code if `stride_con` is `MIN_INT` because some other condition (that doesn't explicitly check that `stride_con` is not `MIN_INT`) causes a bail out from the transformation. I added an explicit bail out in that case in a new commit anyway to make the code more robust. > >> In my opinion ABS() should assert that it has legal input (not MIN_INT) and output (non-negative value) in debug builds. > > I agree and filed [JDK-8328934](https://bugs.openjdk.org/browse/JDK-8328934) for that. Unfortunately, there is still an overflow here when `scale` is min_jlong. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1538303770 From jiefu at openjdk.org Tue Mar 26 02:52:32 2024 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 26 Mar 2024 02:52:32 GMT Subject: RFR: 8329012: IGV: Update required JDK version in README.md Message-ID: While building IGV, I found the jdk version described in the README.md is incorrect. According to the pom.xml: https://github.com/openjdk/jdk/blob/master/src/utils/IdealGraphVisualizer/pom.xml#L82 It should be updated to "between 17 and 21". Thanks. Best regards, Jie ------------- Commit messages: - 8329012: IGV: Update required JDK version in README.md Changes: https://git.openjdk.org/jdk/pull/18481/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18481&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329012 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18481.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18481/head:pull/18481 PR: https://git.openjdk.org/jdk/pull/18481 From ksakata at openjdk.org Tue Mar 26 04:58:21 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Tue, 26 Mar 2024 04:58:21 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: <7wr0hcCKT_Wbgt02FrfedGp1yKTvv1tTZprSvitoGcA=.1b9d1806-af6d-4a43-aca2-a9bed6934404@github.com> On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. PING: Could someone please review this pull request? I'd like to only focus on removing the `#ifndef` conditional directive in this PR, since considering whether to remove the `#define noreg` definition is outside of the scope of the JBS issue. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-2019389884 From rcastanedalo at openjdk.org Tue Mar 26 05:14:25 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 26 Mar 2024 05:14:25 GMT Subject: RFR: 8329012: IGV: Update required JDK version in README.md In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 02:48:35 GMT, Jie Fu wrote: > While building IGV, I found the jdk version described in the README.md is incorrect. > > According to the pom.xml: https://github.com/openjdk/jdk/blob/master/src/utils/IdealGraphVisualizer/pom.xml#L82 > It should be updated to "between 17 and 21". > > Thanks. > Best regards, > Jie Thanks for fixing this! You can consider this changeset trivial. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18481#pullrequestreview-1959328600 From jiefu at openjdk.org Tue Mar 26 06:04:27 2024 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 26 Mar 2024 06:04:27 GMT Subject: RFR: 8329012: IGV: Update required JDK version in README.md In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 05:11:23 GMT, Roberto Casta?eda Lozano wrote: >> While building IGV, I found the jdk version described in the README.md is incorrect. >> >> According to the pom.xml: https://github.com/openjdk/jdk/blob/master/src/utils/IdealGraphVisualizer/pom.xml#L82 >> It should be updated to "between 17 and 21". >> >> Thanks. >> Best regards, >> Jie > > Thanks for fixing this! You can consider this changeset trivial. Thanks @robcasloz for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18481#issuecomment-2019467113 From jiefu at openjdk.org Tue Mar 26 06:04:27 2024 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 26 Mar 2024 06:04:27 GMT Subject: Integrated: 8329012: IGV: Update required JDK version in README.md In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 02:48:35 GMT, Jie Fu wrote: > While building IGV, I found the jdk version described in the README.md is incorrect. > > According to the pom.xml: https://github.com/openjdk/jdk/blob/master/src/utils/IdealGraphVisualizer/pom.xml#L82 > It should be updated to "between 17 and 21". > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 44549b60 Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/44549b605a7aad1e3143a4058ef6504a7c04167a Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8329012: IGV: Update required JDK version in README.md Reviewed-by: rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/18481 From epeter at openjdk.org Tue Mar 26 06:36:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 26 Mar 2024 06:36:22 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 16:51:29 GMT, Kangcheng Xu wrote: >> test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 58: >> >>> 56: & !(Integer.compareUnsigned((m & x), m) > 0) >>> 57: & Integer.compareUnsigned((x & m), m + 1) < 0 >>> 58: & Integer.compareUnsigned((m & x), m + 1) < 0; >> >> For easier reading, I would have put the `&` at the end of the line. >> Btw: is this supposed to be a bitwise or a binary and? > > Done. I'm intentionally using bitwise `&` here so that the compiler doesn't short-circuits on conditions with `&&`. We wish to test all for correctness. The indentation is still not right, now it looks like the first `!` might apply to all lines. My "short-circuit" you mean exit after checking the first term, rather than evaluating all? Ok, I get that, makes sense. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1538657223 From epeter at openjdk.org Tue Mar 26 06:48:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 26 Mar 2024 06:48:22 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v6] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 20:10:53 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > also renames the class name in @run test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGVN.java line 53: > 51: @Test > 52: @Arguments(values = {Argument.DEFAULT, Argument.DEFAULT}) > 53: @IR(failOn = IRNode.CMP_U, phase = CompilePhase.AFTER_PARSING, applyIfPlatform = {"x86", "false"}) Is the 32bit x86 the only platform that does not support it? An alternative would be to enable the IR rule on all platforms that we know support the unsigned compare. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1538665654 From epeter at openjdk.org Tue Mar 26 06:48:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 26 Mar 2024 06:48:22 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v6] In-Reply-To: References: Message-ID: <0lCJvmKHW9CSdymRKKTxvxYCnHb3-1FHNnmmamFKVwQ=.e181629f-fac5-4c29-90f8-9e973e96e2ef@github.com> On Tue, 26 Mar 2024 06:44:18 GMT, Emanuel Peter wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> also renames the class name in @run > > test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGVN.java line 53: > >> 51: @Test >> 52: @Arguments(values = {Argument.DEFAULT, Argument.DEFAULT}) >> 53: @IR(failOn = IRNode.CMP_U, phase = CompilePhase.AFTER_PARSING, applyIfPlatform = {"x86", "false"}) > > Is the 32bit x86 the only platform that does not support it? An alternative would be to enable the IR rule on all platforms that we know support the unsigned compare. Otherwise some less well tested platform might not support it, and then they have to fix up this rule later, having less context/understanding and might get things wrong then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1538666550 From duke at openjdk.org Tue Mar 26 07:21:43 2024 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 26 Mar 2024 07:21:43 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode Message-ID: This patch prohibits the conversion from "(x+1)+y" into "(x+y)+1" when y is a CallNode to reduce unnecessary spillcode and ADDNode. Testing: tier1-3 in x86_64 and LoongArch64 ------------- Commit messages: - Merge branch 'openjdk:master' into 8328865 - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode Changes: https://git.openjdk.org/jdk/pull/18482/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18482&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328865 Stats: 11 lines in 1 file changed: 4 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/18482.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18482/head:pull/18482 PR: https://git.openjdk.org/jdk/pull/18482 From kxu at openjdk.org Tue Mar 26 08:22:23 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 26 Mar 2024 08:22:23 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v6] In-Reply-To: <0lCJvmKHW9CSdymRKKTxvxYCnHb3-1FHNnmmamFKVwQ=.e181629f-fac5-4c29-90f8-9e973e96e2ef@github.com> References: <0lCJvmKHW9CSdymRKKTxvxYCnHb3-1FHNnmmamFKVwQ=.e181629f-fac5-4c29-90f8-9e973e96e2ef@github.com> Message-ID: On Tue, 26 Mar 2024 06:45:29 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGVN.java line 53: >> >>> 51: @Test >>> 52: @Arguments(values = {Argument.DEFAULT, Argument.DEFAULT}) >>> 53: @IR(failOn = IRNode.CMP_U, phase = CompilePhase.AFTER_PARSING, applyIfPlatform = {"x86", "false"}) >> >> Is the 32bit x86 the only platform that does not support it? An alternative would be to enable the IR rule on all platforms that we know support the unsigned compare. > > Otherwise some less well tested platform might not support it, and then they have to fix up this rule later, having less context/understanding and might get things wrong then. Good point. I know x64 and aarch64 support it for sure, but it looks like RISCV also has supports for it [0][1]. I'm not able to verify this on RISCV for not having access to such a machine. [0] https://github.com/openjdk/jdk/blob/44549b605a7aad1e3143a4058ef6504a7c04167a/src/hotspot/cpu/riscv/riscv.ad#L8800-L8817 [1] https://github.com/openjdk/jdk/commit/f9795d0d09a82cafb3e79ad8667e505c194d745b ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1538762860 From roland at openjdk.org Tue Mar 26 08:24:28 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 26 Mar 2024 08:24:28 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 22:23:08 GMT, Dean Long wrote: >>> In my opinion ABS() should assert that it has legal input (not MIN_INT) and output (non-negative value) in debug builds. >> >> I agree and filed [JDK-8328934](https://bugs.openjdk.org/browse/JDK-8328934) for that. > > Unfortunately, there is still an overflow here when `scale` is min_jlong. Right but isn't it harmless in this particular case? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1538765899 From kxu at openjdk.org Tue Mar 26 08:26:46 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 26 Mar 2024 08:26:46 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v7] In-Reply-To: References: Message-ID: <6gvlrcMfuzkluivBvF_6D-bfEuIuyGSS9gT61M5tGPU=.f0e217bb-6b80-449b-a409-dad4ce78df2b@github.com> > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: apply test only on x64, aarch64 and riscv64 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/02c01edf..ae5bed23 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=05-06 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Tue Mar 26 08:31:48 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 26 Mar 2024 08:31:48 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v8] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with two additional commits since the last revision: - update comments - fix indentation again ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/ae5bed23..53cf5b3b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=06-07 Stats: 5 lines in 1 file changed: 1 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Tue Mar 26 08:35:23 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 26 Mar 2024 08:35:23 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 06:33:16 GMT, Emanuel Peter wrote: >> Done. I'm intentionally using bitwise `&` here so that the compiler doesn't short-circuits on conditions with `&&`. We wish to test all for correctness. > > The indentation is still not right, now it looks like the first `!` might apply to all lines. > My "short-circuit" you mean exit after checking the first term, rather than evaluating all? Ok, I get that, makes sense. Fixed indentation (again). On second thought, it doesn't really matter whether `&` or `&&` is used here because all terms will be evaluated regardless. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1538785342 From dlunden at openjdk.org Tue Mar 26 14:40:51 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 26 Mar 2024 14:40:51 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA [v2] In-Reply-To: References: Message-ID: > The library intrinsic `_copyOfRange` does not add a guard for start indices that are larger than the length of the source arrays. Macro expansion of `ArrayCopy` nodes later adds such a guard, but in certain situations escape analysis may result in removing the `ArrayCopy` node before it is expanded. The result is incorrect behavior of the compiled program (as the missing guard may have relevant side effects, such as throwing an exception). > > ### Changeset > > - Add the missing guard (start index <= source array length). > - Add a regression test. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/8388044152) > - tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Readd negative end check ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18472/files - new: https://git.openjdk.org/jdk/pull/18472/files/47cfe37d..2fe44dda Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18472&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18472&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18472.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18472/head:pull/18472 PR: https://git.openjdk.org/jdk/pull/18472 From dlunden at openjdk.org Tue Mar 26 14:40:51 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 26 Mar 2024 14:40:51 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA [v2] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 16:50:36 GMT, Vladimir Kozlov wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Readd negative end check > > src/hotspot/share/opto/library_call.cpp line 4323: > >> 4321: // Bail out if start is negative. >> 4322: generate_negative_guard(start, bailout, &start); >> 4323: > > I think we need this check to avoid underflow in integer expression (end - start): (min_int - 1) == max_int OK, I see. Thanks, I readded the `end` guard. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18472#discussion_r1539380307 From dfenacci at openjdk.org Tue Mar 26 16:02:23 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 26 Mar 2024 16:02:23 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic [v3] In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 21:09:31 GMT, Yudi Zheng wrote: >> Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > address comment. `multiply_to_len` seems to be used by `generate_squareToLen` as well for aarch64 and riscv but `zlen` is still passed in a register. https://github.com/openjdk/jdk/blob/870a6127cf54264c691f7322d775b202705c3bfa/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4710 https://github.com/openjdk/jdk/blob/870a6127cf54264c691f7322d775b202705c3bfa/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp#L2881 I think it might work anyway but it might be better to adapt them if only for completeness. ------------- PR Review: https://git.openjdk.org/jdk/pull/18226#pullrequestreview-1960906919 From kvn at openjdk.org Tue Mar 26 16:02:23 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 16:02:23 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA [v2] In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 14:40:51 GMT, Daniel Lund?n wrote: >> The library intrinsic `_copyOfRange` does not add a guard for start indices that are larger than the length of the source arrays. Macro expansion of `ArrayCopy` nodes later adds such a guard, but in certain situations escape analysis may result in removing the `ArrayCopy` node before it is expanded. The result is incorrect behavior of the compiled program (as the missing guard may have relevant side effects, such as throwing an exception). >> >> ### Changeset >> >> - Add the missing guard (start index <= source array length). >> - Add a regression test. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/8388044152) >> - tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Readd negative end check Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18472#pullrequestreview-1960906906 From kvn at openjdk.org Tue Mar 26 16:37:26 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 16:37:26 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode In-Reply-To: References: Message-ID: <9H5q9ooTMan-QGaTGKirdTbWR0P4RSf3GUc5NrUxrSE=.4216de24-8d5f-4bdb-835d-27e0f948fc24@github.com> On Tue, 26 Mar 2024 07:17:21 GMT, SUN Guoyun wrote: > This patch prohibits the conversion from "(x+1)+y" into "(x+y)+1" when y is a CallNode to reduce unnecessary spillcode and ADDNode. > > Testing: tier1-3 in x86_64 and LoongArch64 I think it is not a call node but (x+1) result is used by debug info of call node. For example, next code could be the same issue (I did not verified it): static int y = 0; ... int foo(int x, int z) { int a = x + 1; y = a; return a + z; } ------------- PR Review: https://git.openjdk.org/jdk/pull/18482#pullrequestreview-1961080991 From kvn at openjdk.org Tue Mar 26 16:43:22 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 16:43:22 GMT Subject: RFR: 8328181: C2: assert(MaxVectorSize >= 32) failed: vector length should be >= 32 In-Reply-To: References: Message-ID: On Sun, 24 Mar 2024 09:58:59 GMT, Jatin Bhateja wrote: > This bug fix patch tightens the predication check for small constant length clear array pattern and relaxes associated feature checks. Modified few comments for clarity. > > Kindly review and approve. > > Best Regards, > Jatin src/hotspot/cpu/x86/x86.ad line 1755: > 1753: case Op_ClearArray: > 1754: if ((size_in_bits != 512) && !VM_Version::supports_avx512vl()) { > 1755: return false; Please add comment to clarify condition. I am reading it as ClearArray will not be supported for NOT avx512 because we can have vector length 512 bits for not avx512. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18464#discussion_r1539707965 From kvn at openjdk.org Tue Mar 26 16:46:29 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 16:46:29 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18235#pullrequestreview-1961106669 From dlong at openjdk.org Tue Mar 26 16:47:32 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 26 Mar 2024 16:47:32 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 08:22:03 GMT, Roland Westrelin wrote: >> Unfortunately, there is still an overflow here when `scale` is min_jlong. > > Right but isn't it harmless in this particular case? No, if it's undefined behavior, we can't be sure what result the C++ compiler will give. And if we test with -ftrapv it will crash. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1539717622 From kvn at openjdk.org Tue Mar 26 16:50:24 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 16:50:24 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. Okay, I approve it. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18115#pullrequestreview-1961117533 From kvn at openjdk.org Tue Mar 26 16:52:25 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 16:52:25 GMT Subject: RFR: 8326421: Add jtreg test for large arrayCopy disjoint case. In-Reply-To: References: Message-ID: On Thu, 22 Feb 2024 13:01:50 GMT, Swati Sharma wrote: > Hi All, > > Added a new jtreg test case for large arrayCopy disjoint case. > This will test byte array copy operation for aligned and non aligned cases with array length greater than 2.5MB. > > Please review and provide your feedback. > > Thanks, > Swati > Intel test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyDisjointLarge.java line 32: > 30: * @summary Test large arrayCopy. > 31: * > 32: * @run main/othervm/timeout=600 -XX:-TieredCompilation -Xbatch compiler.arraycopy.TestArrayCopyDisjointLarge What was the reason to use these 2 flags `-XX:-TieredCompilation -Xbatch`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17962#discussion_r1539724959 From epeter at openjdk.org Tue Mar 26 17:15:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 26 Mar 2024 17:15:21 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal In-Reply-To: References: Message-ID: <-IotoxUWp3oltrrTRDMUNRlRUcFBB__ReS6i62afLTI=.f28b6942-caa4-4759-93bc-4f8f4f353393@github.com> On Mon, 25 Mar 2024 20:41:20 GMT, Vladimir Kozlov wrote: > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 @vnkozlov looks reasonable. What about `NoRTMLockEliding`? I guess it is a compiler option. How does that have to be handled? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18478#issuecomment-2021018283 From vlivanov at openjdk.org Tue Mar 26 17:55:21 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 26 Mar 2024 17:55:21 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 20:41:20 GMT, Vladimir Kozlov wrote: > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 src/hotspot/share/runtime/arguments.cpp line 505: > 503: { "RegisterFinalizersAtInit", JDK_Version::jdk(22), JDK_Version::jdk(23), JDK_Version::jdk(24) }, > 504: #if defined(X86) > 505: { "UseRTMLocking", JDK_Version::jdk(23), JDK_Version::jdk(24), JDK_Version::jdk(25) }, Why do you include experimental options? Only `UseRTMLocking`, `UseRTMDeopt`, and `RTMRetryCount` are product. src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMLocking, false, \ src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMForStackLocks, false, EXPERIMENTAL, \ src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMDeopt, false, \ src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMRetryCount, 5, \ src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMSpinLoopCount, 100, EXPERIMENTAL, \ src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMAbortThreshold, 1000, EXPERIMENTAL, \ src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMLockingThreshold, 10000, EXPERIMENTAL, \ src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMAbortRatio, 50, EXPERIMENTAL, \ src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMTotalCountIncrRate, 64, EXPERIMENTAL, \ src/hotspot//cpu/x86/globals_x86.hpp: product(intx, RTMLockingCalculationDelay, 0, EXPERIMENTAL, \ src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMXendForLockBusy, true, EXPERIMENTAL, \ src/hotspot//share/opto/c2_globals.hpp: product(bool, PrintPreciseRTMLockingStatistics, false, DIAGNOSTIC, \ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1539829996 From kvn at openjdk.org Tue Mar 26 18:27:23 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 18:27:23 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal In-Reply-To: <-IotoxUWp3oltrrTRDMUNRlRUcFBB__ReS6i62afLTI=.f28b6942-caa4-4759-93bc-4f8f4f353393@github.com> References: <-IotoxUWp3oltrrTRDMUNRlRUcFBB__ReS6i62afLTI=.f28b6942-caa4-4759-93bc-4f8f4f353393@github.com> Message-ID: On Tue, 26 Mar 2024 17:12:41 GMT, Emanuel Peter wrote: > @vnkozlov looks reasonable. What about `NoRTMLockEliding`? I guess it is a compiler option. How does that have to be handled? We currently don't have mechanism to handle deprecation for `CompileCommand` options. But in both places, [methodData.cpp#L1338](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/methodData.cpp#L1338) and [compile.cpp#L1083](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/compile.cpp#L1083), it is guarded by `UseRTMLocking` flag check so we will get deprecation notification for `UseRTMLocking`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18478#issuecomment-2021173692 From kvn at openjdk.org Tue Mar 26 18:27:24 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 18:27:24 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 17:53:08 GMT, Vladimir Ivanov wrote: >> HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. >> >> RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). >> >> New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). >> >> I propose to deprecate the related flags and remove the flags and all related code in a later release. >> >> Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) >> >> Testing: tier1 > > src/hotspot/share/runtime/arguments.cpp line 505: > >> 503: { "RegisterFinalizersAtInit", JDK_Version::jdk(22), JDK_Version::jdk(23), JDK_Version::jdk(24) }, >> 504: #if defined(X86) >> 505: { "UseRTMLocking", JDK_Version::jdk(23), JDK_Version::jdk(24), JDK_Version::jdk(25) }, > > Why do you include experimental options? > > Only `UseRTMLocking`, `UseRTMDeopt`, and `RTMRetryCount` are product. > > > src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMLocking, false, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMForStackLocks, false, EXPERIMENTAL, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMDeopt, false, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMRetryCount, 5, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMSpinLoopCount, 100, EXPERIMENTAL, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMAbortThreshold, 1000, EXPERIMENTAL, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMLockingThreshold, 10000, EXPERIMENTAL, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMAbortRatio, 50, EXPERIMENTAL, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMTotalCountIncrRate, 64, EXPERIMENTAL, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(intx, RTMLockingCalculationDelay, 0, EXPERIMENTAL, \ > src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMXendForLockBusy, true, EXPERIMENTAL, \ > src/hotspot//share/opto/c2_globals.hpp: product(bool, PrintPreciseRTMLockingStatistics, false, DIAGNOSTIC, \ I listed all RTM flags to get VM's deprecation message for all of them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1539872281 From kvn at openjdk.org Tue Mar 26 18:27:24 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 18:27:24 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 18:23:22 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/runtime/arguments.cpp line 505: >> >>> 503: { "RegisterFinalizersAtInit", JDK_Version::jdk(22), JDK_Version::jdk(23), JDK_Version::jdk(24) }, >>> 504: #if defined(X86) >>> 505: { "UseRTMLocking", JDK_Version::jdk(23), JDK_Version::jdk(24), JDK_Version::jdk(25) }, >> >> Why do you include experimental options? >> >> Only `UseRTMLocking`, `UseRTMDeopt`, and `RTMRetryCount` are product. >> >> >> src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMLocking, false, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMForStackLocks, false, EXPERIMENTAL, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMDeopt, false, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMRetryCount, 5, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMSpinLoopCount, 100, EXPERIMENTAL, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMAbortThreshold, 1000, EXPERIMENTAL, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMLockingThreshold, 10000, EXPERIMENTAL, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMAbortRatio, 50, EXPERIMENTAL, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(int, RTMTotalCountIncrRate, 64, EXPERIMENTAL, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(intx, RTMLockingCalculationDelay, 0, EXPERIMENTAL, \ >> src/hotspot//cpu/x86/globals_x86.hpp: product(bool, UseRTMXendForLockBusy, true, EXPERIMENTAL, \ >> src/hotspot//share/opto/c2_globals.hpp: product(bool, PrintPreciseRTMLockingStatistics, false, DIAGNOSTIC, \ > > I listed all RTM flags to get VM's deprecation message for all of them. But I missed `PrintPreciseRTMLockingStatistics`. I will added it too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1539874399 From kvn at openjdk.org Tue Mar 26 18:31:22 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 18:31:22 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 18:24:51 GMT, Vladimir Kozlov wrote: >> I listed all RTM flags to get VM's deprecation message for all of them. > > But I missed `PrintPreciseRTMLockingStatistics`. I will added it too. CSR may need only to list product flags but I decided to list them all in CSR too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1539879951 From kvn at openjdk.org Tue Mar 26 18:47:49 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 18:47:49 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v2] In-Reply-To: References: Message-ID: > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Added PrintPreciseRTMLockingStatistics flag for deprecation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18478/files - new: https://git.openjdk.org/jdk/pull/18478/files/ea5010e5..a8e3afa6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=00-01 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18478.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18478/head:pull/18478 PR: https://git.openjdk.org/jdk/pull/18478 From simonis at openjdk.org Tue Mar 26 19:19:43 2024 From: simonis at openjdk.org (Volker Simonis) Date: Tue, 26 Mar 2024 19:19:43 GMT Subject: RFR: 8251462: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 Message-ID: Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: public static native void emptyStaticNativeMethod(); @Benchmark public static void baseline() { } @Benchmark public static void staticMethodCallingStatic() { emptyStaticMethod(); } @Benchmark public static void staticMethodCallingStaticNative() { emptyStaticNativeMethod(); } @Benchmark @Fork(jvmArgsAppend = "-XX:-TieredCompilation") public static void staticMethodCallingStaticNativeNoTiered() { emptyStaticNativeMethod(); } @Benchmark @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") public static void staticMethodCallingStaticNativeIntStub() { emptyStaticNativeMethod(); } JDK 11 ====== Benchmark Mode Cnt Score Error Units NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op JDK 17 & 21 =========== Benchmark Mode Cnt Score Error Units NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method 172 112 n 0 io.simonis.NativeCall::emptyStaticNativeMethod (native) (static) 173 113 b 4 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method 173 111 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) made not entrant As you can see, the native wrapper for `NativeCall::emptyStaticNativeMethod()` gets compiled with compiled id 112. If we run with `-XX:-TieredCompilation`: 117 5 b io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method 117 6 n io.simonis.NativeCall::emptyStaticNativeMethod (native) (static) There's still a native wrapper created with compile id 6. With JDK 17 and later, the `-XX:+PrintCompilation` output looks similar for the default `-XX:+TieredCompilation` case: 56 26 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method 56 27 n 0 io.simonis.NativeCall::emptyStaticNativeMethod (native) (static) 56 28 b 4 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method 56 26 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) made not entrant But with `-XX:-TieredCompilation`, we don't generate the native wrapper any more: 58 5 b io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method Which basically means that we're always invoking the native method through the interpreter stub. There are certainly different ways to fix this issue. The one proposed in this PR seemed simple and non-intrusive to me but I'm open for better alternatives. I'd especially appreciate if @veresov could take a look and advice. Once we agree on the problem and a solution for it, I can also add a test and potentially the JMH benchmark from the JBS issue to this PR. ------------- Commit messages: - 8251462: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 Changes: https://git.openjdk.org/jdk/pull/18496/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18496&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8251462 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18496.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18496/head:pull/18496 PR: https://git.openjdk.org/jdk/pull/18496 From vlivanov at openjdk.org Tue Mar 26 19:53:22 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 26 Mar 2024 19:53:22 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v2] In-Reply-To: References: Message-ID: <_PFfe_eityVaZ4vGbRX4l2zJbTpqyXWZYTBO2ZqAA7M=.27fe4d67-0043-4452-b298-c859d44a12c6@github.com> On Tue, 26 Mar 2024 18:28:23 GMT, Vladimir Kozlov wrote: >> But I missed `PrintPreciseRTMLockingStatistics`. I will added it too. > > CSR may need only to list product flags but I decided to list them all in CSR too. > I listed all RTM flags to get VM's deprecation message for all of them. Any particular point in doing so? IMO it looks excessive. Moreover, all flags imply `UseRTMLocking`. Also, I don't see any other non-product flags in `special_jvm_flags` table you modify. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540006845 From dlong at openjdk.org Tue Mar 26 20:48:25 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 26 Mar 2024 20:48:25 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v7] In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 09:49:12 GMT, Galder Zamarre?o wrote: >> src/hotspot/share/c1/c1_GraphBuilder.cpp line 2146: >> >>> 2144: ciType* receiver_type; >>> 2145: if (target->get_Method()->intrinsic_id() == vmIntrinsics::_clone && >>> 2146: ((receiver_type = state()->stack_at(state()->stack_size() - inline_target->arg_size())->exact_type()) == nullptr || // clone target is phi >> >> I don't think target-specific logic belongs here. And I don't understand the point about Phi nodes. Isn't the holder_known flag enough? For primitive arrays, isn't it true that inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone? > >> I don't think target-specific logic belongs here. And I don't understand the point about Phi nodes. Isn't the holder_known flag enough? > > In my testing `holder_known` was not enough to detect objects that are not Phi. For example: > > > static int[] test(int[] ints) > { > return ints.clone(); > } > > > `holder_known` is false when it tries to C1 compile `ints.clone()`, am I missing something here? > >> For primitive arrays, isn't it true that inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone? > > Possibly, but in this part of the logic I'm trying to find situations in which I don't want to apply the `clone` intrinsic. And those situations are non-array objects, and for arrays, those whose elements are not primitives. I don't see how I can craft such a condition with only `inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone`? IOW, that condition might be true for primitive arrays, but is it false for non-array objects and non-primitive arrays? You're right about holder_known, but why do you need to check for _clone specifically at line 2137? If there is logic missing that prevents an inlining attempt then I think it should be fixed first, rather than in a followup. And I see that you need to do a receiver type check to allow only primitive arrays. Can you do that in append_alloc_array_copy, and bailout if not successful? The logic in build_graph_for_intrinsic would need to change slightly to support this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17667#discussion_r1540085587 From kvn at openjdk.org Tue Mar 26 22:26:23 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 22:26:23 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v2] In-Reply-To: <_PFfe_eityVaZ4vGbRX4l2zJbTpqyXWZYTBO2ZqAA7M=.27fe4d67-0043-4452-b298-c859d44a12c6@github.com> References: <_PFfe_eityVaZ4vGbRX4l2zJbTpqyXWZYTBO2ZqAA7M=.27fe4d67-0043-4452-b298-c859d44a12c6@github.com> Message-ID: On Tue, 26 Mar 2024 19:50:58 GMT, Vladimir Ivanov wrote: >> CSR may need only to list product flags but I decided to list them all in CSR too. > >> I listed all RTM flags to get VM's deprecation message for all of them. > > Any particular point in doing so? IMO it looks excessive. Moreover, all flags imply `UseRTMLocking`. > > Also, I don't see any other non-product flags in `special_jvm_flags` table you modify. Some RTM flags are guarded by `UseRTMLocking` and some are not: [java.cpp#L267](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/java.cpp#L267) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540183079 From kvn at openjdk.org Tue Mar 26 22:40:21 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 22:40:21 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v2] In-Reply-To: References: <_PFfe_eityVaZ4vGbRX4l2zJbTpqyXWZYTBO2ZqAA7M=.27fe4d67-0043-4452-b298-c859d44a12c6@github.com> Message-ID: On Tue, 26 Mar 2024 22:23:53 GMT, Vladimir Kozlov wrote: >>> I listed all RTM flags to get VM's deprecation message for all of them. >> >> Any particular point in doing so? IMO it looks excessive. Moreover, all flags imply `UseRTMLocking`. >> >> Also, I don't see any other non-product flags in `special_jvm_flags` table you modify. > > Some RTM flags are guarded by `UseRTMLocking` and some are not: [java.cpp#L267](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/java.cpp#L267) I want this PR to be clear message that we deprecate all RTM related flags with goal to remove all of them and RTM code from HotSpot. Actually `java.1` man page listed `RTMAbortRatio` even so it is experimental. It listed product flags as well. Which reminds me that I need to update `java.1` man page to move these flags to `DEPRECATED JAVA OPTIONS` section. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540196410 From simonis at openjdk.org Tue Mar 26 22:43:48 2024 From: simonis at openjdk.org (Volker Simonis) Date: Tue, 26 Mar 2024 22:43:48 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v2] In-Reply-To: References: Message-ID: > Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). > > The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: > > public static native void emptyStaticNativeMethod(); > > @Benchmark > public static void baseline() { > } > > @Benchmark > public static void staticMethodCallingStatic() { > emptyStaticMethod(); > } > > @Benchmark > public static void staticMethodCallingStaticNative() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:-TieredCompilation") > public static void staticMethodCallingStaticNativeNoTiered() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") > public static void staticMethodCallingStaticNativeIntStub() { > emptyStaticNativeMethod(); > } > > > JDK 11 > ====== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op > > > JDK 17 & 21 > =========== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op > > > The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: > > 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) > @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method > 172 112 n 0 io.simonis.NativeCall::emptyStaticNativeMethod (native... Volker Simonis has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18496/files - new: https://git.openjdk.org/jdk/pull/18496/files/7eb0d11d..157e124e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18496&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18496&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18496.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18496/head:pull/18496 PR: https://git.openjdk.org/jdk/pull/18496 From kvn at openjdk.org Tue Mar 26 22:44:23 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 22:44:23 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v2] In-Reply-To: References: <_PFfe_eityVaZ4vGbRX4l2zJbTpqyXWZYTBO2ZqAA7M=.27fe4d67-0043-4452-b298-c859d44a12c6@github.com> Message-ID: <6lRKmDIkp9wVzD1qtJIoI81vxJKTSiTVBHzI0ohf7OE=.bb146241-91f8-408a-867e-3d6e96c5c86f@github.com> On Tue, 26 Mar 2024 22:38:12 GMT, Vladimir Kozlov wrote: >> Some RTM flags are guarded by `UseRTMLocking` and some are not: [java.cpp#L267](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/java.cpp#L267) > > I want this PR to be clear message that we deprecate all RTM related flags with goal to remove all of them and RTM code from HotSpot. > Actually `java.1` man page listed `RTMAbortRatio` even so it is experimental. It listed product flags as well. > Which reminds me that I need to update `java.1` man page to move these flags to `DEPRECATED JAVA OPTIONS` section. Okay, lets ask main expert @dholmes-ora what we should do: list only product flags in this PR and CSR or listed in `java.1` or all? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540198602 From kvn at openjdk.org Tue Mar 26 23:06:22 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 23:06:22 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v2] In-Reply-To: <6lRKmDIkp9wVzD1qtJIoI81vxJKTSiTVBHzI0ohf7OE=.bb146241-91f8-408a-867e-3d6e96c5c86f@github.com> References: <_PFfe_eityVaZ4vGbRX4l2zJbTpqyXWZYTBO2ZqAA7M=.27fe4d67-0043-4452-b298-c859d44a12c6@github.com> <6lRKmDIkp9wVzD1qtJIoI81vxJKTSiTVBHzI0ohf7OE=.bb146241-91f8-408a-867e-3d6e96c5c86f@github.com> Message-ID: <0mKdrj4r01JJmVCFDC8SJ62ABuSAx4eS2_BV7e8dR0Q=.6dcdc37a-2038-4232-8a50-f498cafd0d3f@github.com> On Tue, 26 Mar 2024 22:41:35 GMT, Vladimir Kozlov wrote: >> I want this PR to be clear message that we deprecate all RTM related flags with goal to remove all of them and RTM code from HotSpot. >> Actually `java.1` man page listed `RTMAbortRatio` even so it is experimental. It listed product flags as well. >> Which reminds me that I need to update `java.1` man page to move these flags to `DEPRECATED JAVA OPTIONS` section. > > Okay, lets ask main expert @dholmes-ora what we should do: list only product flags in this PR and CSR or listed in `java.1` or all? @dholmes-ora told me to list only product flags in PR and CSR. And update man page. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540213522 From kvn at openjdk.org Tue Mar 26 23:27:46 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 23:27:46 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v3] In-Reply-To: References: Message-ID: > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: List only product flags for deprecation. Update java.1 man page. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18478/files - new: https://git.openjdk.org/jdk/pull/18478/files/a8e3afa6..923ee7ca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=01-02 Stats: 123 lines in 2 files changed: 35 ins; 88 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18478.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18478/head:pull/18478 PR: https://git.openjdk.org/jdk/pull/18478 From kvn at openjdk.org Tue Mar 26 23:30:23 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 23:30:23 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v3] In-Reply-To: <0mKdrj4r01JJmVCFDC8SJ62ABuSAx4eS2_BV7e8dR0Q=.6dcdc37a-2038-4232-8a50-f498cafd0d3f@github.com> References: <_PFfe_eityVaZ4vGbRX4l2zJbTpqyXWZYTBO2ZqAA7M=.27fe4d67-0043-4452-b298-c859d44a12c6@github.com> <6lRKmDIkp9wVzD1qtJIoI81vxJKTSiTVBHzI0ohf7OE=.bb146241-91f8-408a-867e-3d6e96c5c86f@github.com> <0mKdrj4r01JJmVCFDC8SJ62ABuSAx4eS2_BV7e8dR0Q=.6dcdc37a-2038-4232-8a50-f498cafd0d3f@github.com> Message-ID: On Tue, 26 Mar 2024 23:04:10 GMT, Vladimir Kozlov wrote: >> Okay, lets ask main expert @dholmes-ora what we should do: list only product flags in this PR and CSR or listed in `java.1` or all? > > @dholmes-ora told me to list only product flags in PR and CSR. And update man page. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540231741 From vlivanov at openjdk.org Tue Mar 26 23:34:22 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 26 Mar 2024 23:34:22 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v3] In-Reply-To: References: Message-ID: <5GY1zw7gLzFyiSuGrNjuWQqDneh56soSz6f31M3tJ9s=.863b1b22-8839-4f5e-88f0-57a02ebd1bfe@github.com> On Tue, 26 Mar 2024 23:27:46 GMT, Vladimir Kozlov wrote: >> HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. >> >> RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). >> >> New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). >> >> I propose to deprecate the related flags and remove the flags and all related code in a later release. >> >> Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) >> >> Testing: tier1 > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > List only product flags for deprecation. Update java.1 man page. Looks good. test/hotspot/jtreg/runtime/CommandLine/VMDeprecatedOptions.java line 69: > 67: Arrays.asList(new String[][] { > 68: {"UseRTMLocking", "false"}, > 69: {"UseRTMDeopt", "false"}, Should `RTMRetryCount` be mentioned here as well? ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18478#pullrequestreview-1961940441 PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540233833 From duke at openjdk.org Tue Mar 26 23:47:38 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 26 Mar 2024 23:47:38 GMT Subject: RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used Message-ID: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> The goal of this PR is improve the performance of convert instructions and address the slowdown when AVX=0 is used. The performance data using the ComputePI.java benchmark (part of this PR) is as follows:
Benchmark (ns/op) | Stock JDK | This PR (AVX=3) | Speedup -- | -- | -- | -- ComputePI.compute_pi_dbl_flt | 511.34 | 511.226 | 1.0 ComputePI.compute_pi_flt_dbl | 2024.06 | 541.544 | 3.7 ComputePI.compute_pi_int_dbl | 695.482 | 506.546 | 1.4 ComputePI.compute_pi_int_flt | 799.268 | 450.298 | 1.8 ComputePI.compute_pi_long_dbl | 802.992 | 577.984 | 1.4 ComputePI.compute_pi_long_flt | 628.62 | 549.057 | 1.1
Benchmark (ns/op) | Stock JDK | This PR (AVX=0) | Speedup -- | -- | -- | -- ComputePI.compute_pi_dbl_flt | 473.778 | 472.529 | 1.0 ComputePI.compute_pi_flt_dbl | 536.004 | 538.418 | 1.0 ComputePI.compute_pi_int_dbl | 458.08 | 460.245 | 1.0 ComputePI.compute_pi_int_flt | 477.305 | 476.975 | 1.0 ComputePI.compute_pi_long_dbl | 455.132 | 455.064 | 1.0 ComputePI.compute_pi_long_flt | 474.734 | 476.571 | 1.0
------------- Commit messages: - fix whitespace changes - 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used Changes: https://git.openjdk.org/jdk/pull/18503/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18503&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8323116 Stats: 211 lines in 4 files changed: 205 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/18503.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18503/head:pull/18503 PR: https://git.openjdk.org/jdk/pull/18503 From kvn at openjdk.org Tue Mar 26 23:54:47 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 23:54:47 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v4] In-Reply-To: References: Message-ID: > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Added RTMRetryCount flag to the test. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18478/files - new: https://git.openjdk.org/jdk/pull/18478/files/923ee7ca..ec7bdbf6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18478.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18478/head:pull/18478 PR: https://git.openjdk.org/jdk/pull/18478 From kvn at openjdk.org Tue Mar 26 23:54:47 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 23:54:47 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v3] In-Reply-To: References: Message-ID: <5_sCNzAEdw_a1Mx1MbBpqxaKl_0EBIqnA-W1P4mIT04=.4319b925-e482-4068-aa27-5cd057469f53@github.com> On Tue, 26 Mar 2024 23:27:46 GMT, Vladimir Kozlov wrote: >> HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. >> >> RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). >> >> New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). >> >> I propose to deprecate the related flags and remove the flags and all related code in a later release. >> >> Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) >> >> Testing: tier1 > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > List only product flags for deprecation. Update java.1 man page. @sviswa7, please review these changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18478#issuecomment-2021668306 From kvn at openjdk.org Tue Mar 26 23:54:48 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Mar 2024 23:54:48 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v3] In-Reply-To: <5GY1zw7gLzFyiSuGrNjuWQqDneh56soSz6f31M3tJ9s=.863b1b22-8839-4f5e-88f0-57a02ebd1bfe@github.com> References: <5GY1zw7gLzFyiSuGrNjuWQqDneh56soSz6f31M3tJ9s=.863b1b22-8839-4f5e-88f0-57a02ebd1bfe@github.com> Message-ID: On Tue, 26 Mar 2024 23:31:50 GMT, Vladimir Ivanov wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> List only product flags for deprecation. Update java.1 man page. > > test/hotspot/jtreg/runtime/CommandLine/VMDeprecatedOptions.java line 69: > >> 67: Arrays.asList(new String[][] { >> 68: {"UseRTMLocking", "false"}, >> 69: {"UseRTMDeopt", "false"}, > > Should `RTMRetryCount` be mentioned here as well? Done. I thought I can only use boolean flags there. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540251011 From kxu at openjdk.org Wed Mar 27 00:33:33 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Wed, 27 Mar 2024 00:33:33 GMT Subject: RFR: 8328528: C2 should optimize long-typed parallel iv in an int counted loop Message-ID: <4kAmwISMCRKsdUxHZsI9SdO-8rLy7a3GbFXd2ERlpi0=.8f87e9e4-cb37-43b6-b17e-a6b424e60c83@github.com> Currently, parallel iv optimization only happens in an int counted loop with int-typed parallel iv's. This PR adds support for long-typed iv to be optimized. Additionally, this ticket contributes to the resolution of [JDK-8275913](https://bugs.openjdk.org/browse/JDK-8275913). Meanwhile, I'm working on adding support for parallel IV replacement for long counted loops which will depend on this PR. ------------- Commit messages: - remove WIP support for long counted loops - Merge branch 'master' into long-typed-parallel-iv - update tests - update tests - update tests - clean up code for pr - clean up code for pr - add tests for int counted loops with long iv - use jlong (long long) for ILP32 - Revert "refactor (x & m) u<= m transformation and add test" - ... and 8 more: https://git.openjdk.org/jdk/compare/907e30ff...2230c7a6 Changes: https://git.openjdk.org/jdk/pull/18489/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18489&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328528 Stats: 161 lines in 2 files changed: 149 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/18489.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18489/head:pull/18489 PR: https://git.openjdk.org/jdk/pull/18489 From sviswanathan at openjdk.org Wed Mar 27 00:38:23 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 27 Mar 2024 00:38:23 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v4] In-Reply-To: References: Message-ID: <4UHecgtxmGvaryWrqosdEfQ5N2OyHnWJPylBjYIgr5k=.fad815d9-e21d-4312-8aa5-c425a51ad2ee@github.com> On Tue, 26 Mar 2024 23:54:47 GMT, Vladimir Kozlov wrote: >> HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. >> >> RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). >> >> New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). >> >> I propose to deprecate the related flags and remove the flags and all related code in a later release. >> >> Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) >> >> Testing: tier1 > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Added RTMRetryCount flag to the test. Rest of the changes look good to me. src/java.base/share/man/java.1 line 2787: > 2785: As a result, the processors repeatedly invalidate the cache lines of > 2786: other processors, which forces them to read from main memory instead of > 2787: their cache. We could move this block also to the new location in java.1 after line 3785. ------------- PR Review: https://git.openjdk.org/jdk/pull/18478#pullrequestreview-1962008475 PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540280004 From dholmes at openjdk.org Wed Mar 27 00:41:22 2024 From: dholmes at openjdk.org (David Holmes) Date: Wed, 27 Mar 2024 00:41:22 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v4] In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 23:54:47 GMT, Vladimir Kozlov wrote: >> HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. >> >> RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). >> >> New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). >> >> I propose to deprecate the related flags and remove the flags and all related code in a later release. >> >> Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) >> >> Testing: tier1 > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Added RTMRetryCount flag to the test. globals.hpp also needs to be updated to mark the flags as `(Deprecated)` It is curious that one of the documented flags was actually experimental. Otherwise looks good. Thanks ------------- Changes requested by dholmes (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18478#pullrequestreview-1962023169 From kvn at openjdk.org Wed Mar 27 00:50:26 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 00:50:26 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v4] In-Reply-To: <4UHecgtxmGvaryWrqosdEfQ5N2OyHnWJPylBjYIgr5k=.fad815d9-e21d-4312-8aa5-c425a51ad2ee@github.com> References: <4UHecgtxmGvaryWrqosdEfQ5N2OyHnWJPylBjYIgr5k=.fad815d9-e21d-4312-8aa5-c425a51ad2ee@github.com> Message-ID: On Wed, 27 Mar 2024 00:19:33 GMT, Sandhya Viswanathan wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Added RTMRetryCount flag to the test. > > src/java.base/share/man/java.1 line 2787: > >> 2785: As a result, the processors repeatedly invalidate the cache lines of >> 2786: other processors, which forces them to read from main memory instead of >> 2787: their cache. > > We could move this block also to the new location in java.1 after line 3785. Okay, may be we should keep it for now since the code is still present in VM. But I think we should remove this big text block when we remove the code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540293708 From kvn at openjdk.org Wed Mar 27 00:58:37 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 00:58:37 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v5] In-Reply-To: References: Message-ID: <7maJjZcKxExjyBJcHlehcHFAy8bbPGf7qy9e6DkmxIs=.cfdb5db3-a888-49b0-8574-5eadcbc2ffd0@github.com> > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Added RTM locking description to man page. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18478/files - new: https://git.openjdk.org/jdk/pull/18478/files/ec7bdbf6..870547ac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=03-04 Stats: 42 lines in 1 file changed: 42 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18478.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18478/head:pull/18478 PR: https://git.openjdk.org/jdk/pull/18478 From kvn at openjdk.org Wed Mar 27 00:58:38 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 00:58:38 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v5] In-Reply-To: References: <4UHecgtxmGvaryWrqosdEfQ5N2OyHnWJPylBjYIgr5k=.fad815d9-e21d-4312-8aa5-c425a51ad2ee@github.com> Message-ID: <0Gg1zwlvZWiRag4E2Mft2Wxa_Ij-z9D0uexjVnXpT_o=.f634d2ee-0758-44c8-84e6-bdd203dd2db8@github.com> On Wed, 27 Mar 2024 00:48:15 GMT, Vladimir Kozlov wrote: >> src/java.base/share/man/java.1 line 2787: >> >>> 2785: As a result, the processors repeatedly invalidate the cache lines of >>> 2786: other processors, which forces them to read from main memory instead of >>> 2787: their cache. >> >> We could move this block also to the new location in java.1 after line 3785. > > Okay, may be we should keep it for now since the code is still present in VM. > But I think we should remove this big text block when we remove the code. Done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18478#discussion_r1540308372 From kvn at openjdk.org Wed Mar 27 01:02:30 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 01:02:30 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v4] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 00:39:05 GMT, David Holmes wrote: > globals.hpp also needs to be updated to mark the flags as `(Deprecated)` Thank you, @dholmes-ora, for review. Do I understand correctly that I need to add `(Deprecated)` only to these 3 product flags descriptions in `globals.hpp`? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18478#issuecomment-2021734660 From kvn at openjdk.org Wed Mar 27 01:13:44 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 01:13:44 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v6] In-Reply-To: References: Message-ID: > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Mark flags as (Deprecated) in description ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18478/files - new: https://git.openjdk.org/jdk/pull/18478/files/870547ac..f15fc939 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18478&range=04-05 Stats: 5 lines in 1 file changed: 2 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18478.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18478/head:pull/18478 PR: https://git.openjdk.org/jdk/pull/18478 From dholmes at openjdk.org Wed Mar 27 01:18:22 2024 From: dholmes at openjdk.org (David Holmes) Date: Wed, 27 Mar 2024 01:18:22 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v6] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 01:13:44 GMT, Vladimir Kozlov wrote: >> HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. >> >> RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). >> >> New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). >> >> I propose to deprecate the related flags and remove the flags and all related code in a later release. >> >> Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) >> >> Testing: tier1 > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Mark flags as (Deprecated) in description Looks good. Thanks ------------- Marked as reviewed by dholmes (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18478#pullrequestreview-1962075883 From sviswanathan at openjdk.org Wed Mar 27 01:18:22 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 27 Mar 2024 01:18:22 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v6] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 01:13:44 GMT, Vladimir Kozlov wrote: >> HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. >> >> RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). >> >> New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). >> >> I propose to deprecate the related flags and remove the flags and all related code in a later release. >> >> Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) >> >> Testing: tier1 > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Mark flags as (Deprecated) in description Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18478#pullrequestreview-1962077129 From kvn at openjdk.org Wed Mar 27 01:47:28 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 01:47:28 GMT Subject: RFR: 8328986: Deprecate UseRTM* flags for removal [v3] In-Reply-To: <5GY1zw7gLzFyiSuGrNjuWQqDneh56soSz6f31M3tJ9s=.863b1b22-8839-4f5e-88f0-57a02ebd1bfe@github.com> References: <5GY1zw7gLzFyiSuGrNjuWQqDneh56soSz6f31M3tJ9s=.863b1b22-8839-4f5e-88f0-57a02ebd1bfe@github.com> Message-ID: On Tue, 26 Mar 2024 23:32:10 GMT, Vladimir Ivanov wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> List only product flags for deprecation. Update java.1 man page. > > Looks good. Thank you @iwanowww, @sviswa7 and @dholmes-ora for reviews. Waiting CSR approval. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18478#issuecomment-2021770366 From duke at openjdk.org Wed Mar 27 06:02:36 2024 From: duke at openjdk.org (Joshua Cao) Date: Wed, 27 Mar 2024 06:02:36 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit Message-ID: The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). Passes hotspot tier1 locally on a Linux machine. ### Benchmarks Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. Baseline: Result "org.renaissance.jdk.streams.JmhParMnemonics.run": N = 25 mean = 3309.611 ?(99.9%) 86.699 ms/op Histogram, ms/op: [3000.000, 3050.000) = 0 [3050.000, 3100.000) = 4 [3100.000, 3150.000) = 1 [3150.000, 3200.000) = 0 [3200.000, 3250.000) = 0 [3250.000, 3300.000) = 0 [3300.000, 3350.000) = 9 [3350.000, 3400.000) = 6 [3400.000, 3450.000) = 5 Percentiles, ms/op: p(0.0000) = 3069.910 ms/op p(50.0000) = 3348.140 ms/op p(90.0000) = 3415.178 ms/op p(95.0000) = 3417.057 ms/op p(99.0000) = 3417.780 ms/op p(99.9000) = 3417.780 ms/op p(99.9900) = 3417.780 ms/op p(99.9990) = 3417.780 ms/op p(99.9999) = 3417.780 ms/op p(100.0000) = 3417.780 ms/op After patch: Result "org.renaissance.jdk.streams.JmhParMnemonics.run": [20/383] N = 25 mean = 2765.754 ?(99.9%) 62.062 ms/op Histogram, ms/op: [2600.000, 2625.000) = 0 [2625.000, 2650.000) = 4 [2650.000, 2675.000) = 2 [2675.000, 2700.000) = 3 [2700.000, 2725.000) = 0 [2725.000, 2750.000) = 0 [2750.000, 2775.000) = 0 [2775.000, 2800.000) = 5 [2800.000, 2825.000) = 5 [2825.000, 2850.000) = 2 [2850.000, 2875.000) = 3 Percentiles, ms/op: p(0.0000) = 2632.734 ms/op p(50.0000) = 2793.454 ms/op p(90.0000) = 2871.524 ms/op p(95.0000) = 2877.469 ms/op p(99.0000) = 2878.872 ms/op p(99.9000) = 2878.872 ms/op p(99.9900) = 2878.872 ms/op p(99.9990) = 2878.872 ms/op p(99.9999) = 2878.872 ms/op p(100.0000) = 2878.872 ms/op We see a 16% improvement in throughput ------------- Commit messages: - Remove unused imports - Handle remaining cases of eliminating StoreStore for escaped objs - Add tests for barriers in constructors - Replace all end of ctor MemBarRelease with MemBarStoreStore - Compute redundancy for StoreStore - 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit Changes: https://git.openjdk.org/jdk/pull/18505/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18505&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8300148 Stats: 97 lines in 6 files changed: 92 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/18505.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18505/head:pull/18505 PR: https://git.openjdk.org/jdk/pull/18505 From epeter at openjdk.org Wed Mar 27 06:21:31 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 06:21:31 GMT Subject: RFR: 8328938: C2 SuperWord: disable vectorization for large stride and scale Message-ID: <9oXO4yuvZbpAxofIUBGVwJ2WyBLPWcP2IHxqZg5nQNQ=.f8f9365c-56c5-4fa9-8075-880f432ac214@github.com> **Problem** In [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) / https://git.openjdk.org/jdk/pull/14785 I fixed the alignment with `AlignVector`. For that, I had to compute `abs(scale)` and `abs(stride)`, as well as `scale * stride`. The issue is that all of these values can overflow the int range (e.g. `abs(min_int) = min_int`). We hit asserts like: `# assert(is_power_of_2(value)) failed: value must be a power of 2: 0xffffffff80000000` Happens because we take `abs(min_int)`, which is `min_int = 0x80000000`, and assuming this was a positive (unsigned) number is a power of 2 `2^31`. We then expand it to `long`, get `0xffffffff80000000`, which is not a power of 2 anymore. This violates the implicit assumptions, and we hit the assert. `# assert(q >= 1) failed: modulo value must be large enough` We have `scale = 2^30` and `stride = 4 = 2^2`. For the alignment calculation we compute `scale * stride = 2^32`, which overflows the int range and becomes zero. Before [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) we could get similar issues with the (old) code in `SuperWord::ref_is_alignable`, if `AlignVector` is enabled: int span = preloop_stride * p.scale_in_bytes(); ... if (vw % span == 0) { if `span == 0` because of overflow, then the `idiv` from the modulo gets a division by zero -> `SIGFPE`. But it seems the bug is possibly a regression from JDK20 b2 [JDK-8286197](https://bugs.openjdk.org/browse/JDK-8286197). Here we enabled certaint Unsafe memory access address patterns, and it is such patterns that the reproducer requires. **Solution** I could either patch up all the code that works with `scale` and `stride`, and make sure no overflows ever happen. But that is quite involved and error prone. I now just disable vectorization for large `scale` and `stride`. This should not have any performance impact, because such large `scale` and `stride` would lead to highly inefficient memory accesses, since they are spaced very far apart. ------------- Commit messages: - 8328938 Changes: https://git.openjdk.org/jdk/pull/18485/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18485&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328938 Stats: 266 lines in 2 files changed: 266 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18485.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18485/head:pull/18485 PR: https://git.openjdk.org/jdk/pull/18485 From thartmann at openjdk.org Wed Mar 27 06:33:22 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 27 Mar 2024 06:33:22 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v2] In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 22:43:48 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 src/hotspot/share/compiler/compilationPolicy.cpp line 1096: > 1094: if (Predicate::apply(method, cur_level, method->invocation_count(), 0)) { > 1095: next_level = CompLevel_full_optimization; > 1096: } Nit: the indentation is wrong. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18496#discussion_r1540544822 From roland at openjdk.org Wed Mar 27 08:18:26 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 08:18:26 GMT Subject: RFR: 8324121: SIGFPE in PhaseIdealLoop::extract_long_range_checks [v2] In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 16:44:25 GMT, Dean Long wrote: >> Right but isn't it harmless in this particular case? > > No, if it's undefined behavior, we can't be sure what result the C++ compiler will give. And if we test with -ftrapv it will crash. Ok. I filed https://bugs.openjdk.org/browse/JDK-8329163 to track it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18397#discussion_r1540642054 From duke at openjdk.org Wed Mar 27 08:45:55 2024 From: duke at openjdk.org (SUN Guoyun) Date: Wed, 27 Mar 2024 08:45:55 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: References: Message-ID: > This patch prohibits the conversion from "(x+1)+y" into "(x+y)+1" when y is a CallNode to reduce unnecessary spillcode and ADDNode. > > Testing: tier1-3 in x86_64 and LoongArch64 > > JMH in x86_64: >
> before:
> Benchmark           Mode  Cnt      Score   Error  Units
> CallNode.test      thrpt    2  26397.733          ops/s
> 
> after:
> Benchmark           Mode  Cnt      Score   Error  Units
> CallNode.test      thrpt    2  27839.337          ops/s
> 
SUN Guoyun has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18482/files - new: https://git.openjdk.org/jdk/pull/18482/files/14dc5dee..269cd945 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18482&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18482&range=00-01 Stats: 1440 lines in 48 files changed: 457 ins; 735 del; 248 mod Patch: https://git.openjdk.org/jdk/pull/18482.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18482/head:pull/18482 PR: https://git.openjdk.org/jdk/pull/18482 From chagedorn at openjdk.org Wed Mar 27 09:14:36 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 09:14:36 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: References: Message-ID: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> On Mon, 25 Mar 2024 12:58:55 GMT, Emanuel Peter wrote: >> I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. >> >> Most importantly: I split it into two classes: `PairSet` and `PackSet`. >> `combine_pairs_to_longer_packs` converts the first into the second. >> >> I was able to simplify the combining, and remove the pack-sorting. >> I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. >> >> I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. >> >> I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: >> Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). >> >> But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. >> >> More details are described in the annotations in the code. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > use left/right instead of s1/s2 in some obvious simple places Thanks for the updates. The PR generally looks good! I now fully reviewed it and left some more comments. src/hotspot/share/opto/superword.cpp line 1088: > 1086: } > 1087: > 1088: bool SuperWord::extend_pairset_with_more_pairs_by_following_def(Node* s1, Node* s2) { Small detail: Here you seem to possibly add multiple pairs while the `following_use` version can only add one pair. Maybe you want to state that in a method comment or adapt the name for the `following_use` version to make that distinction clear. src/hotspot/share/opto/superword.cpp line 1188: > 1186: } > 1187: > 1188: // For a pair (def1. def2), find all use packs (use1, use2), and ensure that their inputs have an order Suggestion: // For a pair (def1, def2), find all use packs (use1, use2), and ensure that their inputs have an order src/hotspot/share/opto/superword.cpp line 1306: > 1304: int save_in = 2 - 1; // 2 operations per instruction in packed form > 1305: > 1306: auto adjacent_profit = [&] () { return 2; }; Thinking about this again, you might also just transform this into a constant and use it instead of a local lambda. src/hotspot/share/opto/superword.cpp line 1326: > 1324: > 1325: // uses of result > 1326: uint ct = 0; Should we rename `ct` to something more descriptive like `number_of_matched_pairs` or something similar? src/hotspot/share/opto/superword.cpp line 1394: > 1392: > 1393: SplitStatus PackSet::split_pack(const char* split_name, > 1394: Node_List* pack, Just wondering if you also plan to have a separate `Pack` class at some point instead of using `Node_List`? If it's not worth we might still want to use a typedef to better show what the intent is. But that's something for another day. src/hotspot/share/opto/superword.cpp line 1414: > 1412: Node* n = pack->at(i); > 1413: set_pack(n, nullptr); > 1414: } Should we have a `remove_pack()` method that performs this operation? Then you could also call that just below at L1446 as well. src/hotspot/share/opto/superword.cpp line 1539: > 1537: > 1538: // Split packs at boundaries where left and right have different use or def packs. > 1539: void SuperWord::split_packs_at_use_def_boundaries() { As above with `PairSet`, I'm also wondering here if these `split` methods could be part of `PackSet`? But I have not checked all the method calls and it could very well be that you would need to pass a reference to `SuperWord` to `PackSet`. From a high-level view, "split packs" feels like an operation you perform on a pack set. Same, for example, as `SuperWord::verify_packs()`. Either way, I think the patch is already quite big and I think we should do that - if wanted - separately src/hotspot/share/opto/superword.cpp line 1764: > 1762: > 1763: // Can code be generated for the pack, restricted to size nodes? > 1764: bool SuperWord::implemented(const Node_List* pack, uint size) const { While fixing `const` you could also add `const` for `size` (same for parameters of the methods below where you fixed `const`). But could also be done separately. src/hotspot/share/opto/superword.cpp line 2084: > 2082: // Create nodes (from packs and scalar-nodes), and add edges, based on the dependency graph. > 2083: void build() { > 2084: const PackSet& packset = _slp->packset(); You could also store the pack set as field since you access it several times. src/hotspot/share/opto/superword.cpp line 2404: > 2402: for (int i = 0; i < body().length(); i++) { > 2403: Node* n = body().at(i); > 2404: Node_List* p = _packset.pack(n); Since you use this pattern a lot, you could also think about having a `SuperWord::pack()` method that delegates to `_packset.pack()`. src/hotspot/share/opto/superword.cpp line 3754: > 3752: } > 3753: > 3754: void PackSet::print_pack(Node_List* pack) const { Could also be made static since you don't access any fields. src/hotspot/share/opto/superword.hpp line 69: > 67: // Doubly-linked pairs. If not linked: -1 > 68: GrowableArray _left_to_right; // bb_idx -> bb_idx > 69: GrowableArray _right_to_left; // bb_idx -> bb_idx I think it's a good solution but still found myself revisiting this several times while looking at the methods below how it works. Would it maybe help to give a visual example? For example: left_to_right: index: 0 1 2 3 value: | -1 | 3 | -1 | -1 | ... => Node with bb_idx 1 is left in a pair with bb_idx 3. right_to_left: index: 0 1 2 3 value: | -1 | -1 | -1 | 1 | ... => Node with bb_idx 3 is right in a pair with bb_idx 1. ``` src/hotspot/share/opto/superword.hpp line 88: > 86: bool has_right(int i) const { return _right_to_left.at(i) != -1; } > 87: bool has_left(const Node* n) const { return _vloop.in_bb(n) && has_left( _body.bb_idx(n)); } > 88: bool has_right(const Node* n) const { return _vloop.in_bb(n) && has_right(_body.bb_idx(n)); } What about naming these `is_left/right()` as in `is_left_in_a_left/right_most_pair()`? I think it's more intuitive. src/hotspot/share/opto/superword.hpp line 96: > 94: int get_right_for(int i) const { return _left_to_right.at(i); } > 95: Node* get_right_for(const Node* n) const { return _body.body().at(get_right_for(_body.bb_idx(n))); } > 96: Node* get_right_or_null_for(const Node* n) const { return has_left(n) ? get_right_for(n) : nullptr; } Just a visual comment: These methods are very densely packed here and somewhat hard to read. Could we somehow group them better together? For example, `body()` is unrelated and could be separated by a new line. src/hotspot/share/opto/superword.hpp line 134: > 132: PairSetIterator(const PairSet& pairset) : > 133: _pairset(pairset), _body(pairset.body()), > 134: _chain_start_bb_idx(-1), _current_bb_idx(-1), Not sure if our style guide says anything about multi-line inits but I think we often put them on separate lines. src/hotspot/share/opto/superword.hpp line 140: > 138: } > 139: > 140: bool done() const { return _chain_start_bb_idx >= _end_bb_idx; } I suggest to follow the style of `left()` and add line breaks. src/hotspot/share/opto/superword.hpp line 263: > 261: const VLoopBody& _body; > 262: > 263: // The "packset" proper: an array of "packs" What do you mean by " The "packset" proper"? src/hotspot/share/opto/superword.hpp line 488: > 486: bool do_vector_loop() { return _do_vector_loop; } > 487: > 488: const PackSet& packset() const { return _packset; } Somehow a strange alignment. You might want to fix that. test/hotspot/jtreg/compiler/loopopts/superword/TestMulAddS2I.java line 38: > 36: import jdk.test.lib.Platform; > 37: > 38: public class TestMulAddS2I { Might be worth to add 8325252 as `@bug` number as well since you added quite a few tests. ------------- PR Review: https://git.openjdk.org/jdk/pull/18276#pullrequestreview-1960798902 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539487611 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539490177 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539495230 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539500944 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539506632 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539515844 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539538019 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540640224 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540644383 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540645771 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540651158 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540664022 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540670242 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540685021 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540689508 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540690674 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540704583 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540708517 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540711897 From chagedorn at openjdk.org Wed Mar 27 09:14:36 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 09:14:36 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 14:44:55 GMT, Emanuel Peter wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.cpp line 1369: > >> 1367: left = right; >> 1368: } >> 1369: _packset.add_pack(pack); > > Note: I replaced a quadratic loop, which basically checked all-with-all pairs, if they can be combined. > An additional benefit: we don't need the packs sorted (this also removes the need to have all nodes annotated with alignment, but that will be more useful in a future RFE). That's great! > src/hotspot/share/opto/superword.hpp line 101: > >> 99: int length() const { return _lefts_in_insertion_order.length(); } >> 100: Node* left_at(int i) const { return _body.body().at(_lefts_in_insertion_order.at(i)); } >> 101: Node* right_at(int i) const { return _body.body().at(get_right_for(_lefts_in_insertion_order.at(i))); } > > Note: I hope to get rid of `_lefts_in_insertion_order` eventually, and then these accessors will disappear too. But for now I need them to iterate in insertion order, doing something similar to a DFS in the pair extension. Make sense but maybe you can rename them `left/right_at_in_insertion_order()` to avoid misuses? Then you could name the `has_left/right(int i)` -> `left/right_at()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539502304 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1540668934 From chagedorn at openjdk.org Wed Mar 27 09:14:36 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 09:14:36 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: <-l6FtsyLY4BElis1Rq6eh01CaHqMo2Pdiftj7CBqvGs=.bdbed138-cd2c-43d5-92a2-83bad538f69c@github.com> On Tue, 26 Mar 2024 15:39:43 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.cpp line 1414: > >> 1412: Node* n = pack->at(i); >> 1413: set_pack(n, nullptr); >> 1414: } > > Should we have a `remove_pack()` method that performs this operation? Then you could also call that just below at L1446 as well. An additional thought: Should we have a special "unmap_pack_for_node" method that then calls `set_pack(n, nullptr)`? It might improve the readability. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1539524372 From duke at openjdk.org Wed Mar 27 09:22:22 2024 From: duke at openjdk.org (SUN Guoyun) Date: Wed, 27 Mar 2024 09:22:22 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: <9H5q9ooTMan-QGaTGKirdTbWR0P4RSf3GUc5NrUxrSE=.4216de24-8d5f-4bdb-835d-27e0f948fc24@github.com> References: <9H5q9ooTMan-QGaTGKirdTbWR0P4RSf3GUc5NrUxrSE=.4216de24-8d5f-4bdb-835d-27e0f948fc24@github.com> Message-ID: On Tue, 26 Mar 2024 16:34:34 GMT, Vladimir Kozlov wrote: > I think it is not a call node but (x+1) result is used by debug info of call node. For example, next code could be the same issue (I did not verified it): > > ``` > static int y = 0; > ... > int foo(int x, int z) { > int a = x + 1; > y = a; > return a + z; > } > ``` Your example is correct and has the same issue. I will make further modifications. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18482#issuecomment-2022285362 From epeter at openjdk.org Wed Mar 27 09:26:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 09:26:21 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: References: <9H5q9ooTMan-QGaTGKirdTbWR0P4RSf3GUc5NrUxrSE=.4216de24-8d5f-4bdb-835d-27e0f948fc24@github.com> Message-ID: On Wed, 27 Mar 2024 09:19:32 GMT, SUN Guoyun wrote: >> I think it is not a call node but (x+1) result is used by debug info of call node. >> For example, next code could be the same issue (I did not verified it): >> >> static int y = 0; >> ... >> int foo(int x, int z) { >> int a = x + 1; >> y = a; >> return a + z; >> } > >> I think it is not a call node but (x+1) result is used by debug info of call node. For example, next code could be the same issue (I did not verified it): >> >> ``` >> static int y = 0; >> ... >> int foo(int x, int z) { >> int a = x + 1; >> y = a; >> return a + z; >> } >> ``` > > Your example is correct and has the same issue. I will make further modifications. @sunny868 do you have a benchmark where you can show the difference? Does spilling matter that much if we are already doing a call anyway? I'm just worried that we prevent the constants from sinking down, and commoning (folding together) with other constants further down. In some cases this is also expected by patter matching in some optimizations, though there would surely be ways to improve those if need be. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18482#issuecomment-2022290380 From epeter at openjdk.org Wed Mar 27 09:26:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 09:26:22 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: References: Message-ID: <_2hqXqJsOzK96l-YTZoKW-G0-NW1vN6CGCCPBRCq9hc=.e2578010-b1f7-4214-bcfd-11f67556ab78@github.com> On Wed, 27 Mar 2024 08:45:55 GMT, SUN Guoyun wrote: >> This patch prohibits the conversion from "(x+1)+y" into "(x+y)+1" when y is a CallNode to reduce unnecessary spillcode and ADDNode. >> >> Testing: tier1-3 in x86_64 and LoongArch64 >> >> JMH in x86_64: >>
>> before:
>> Benchmark           Mode  Cnt      Score   Error  Units
>> CallNode.test      thrpt    2  26397.733          ops/s
>> 
>> after:
>> Benchmark           Mode  Cnt      Score   Error  Units
>> CallNode.test      thrpt    2  27839.337          ops/s
>> 
> > SUN Guoyun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode > - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode Ah, I just saw your attached benchmark. Will have a look at it later. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18482#issuecomment-2022293394 From epeter at openjdk.org Wed Mar 27 09:35:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 09:35:21 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 08:45:55 GMT, SUN Guoyun wrote: >> This patch prohibits the conversion from "(x+1)+y" into "(x+y)+1" when y is a CallNode to reduce unnecessary spillcode and ADDNode. >> >> Testing: tier1-3 in x86_64 and LoongArch64 >> >> JMH in x86_64: >>
>> before:
>> Benchmark           Mode  Cnt      Score   Error  Units
>> CallNode.test      thrpt    2  26397.733          ops/s
>> 
>> after:
>> Benchmark           Mode  Cnt      Score   Error  Units
>> CallNode.test      thrpt    2  27839.337          ops/s
>> 
> > SUN Guoyun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode > - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode A possible counter-example: x1 = something y1 = someCall for (int i = 0; i < a.length; i++) { a[i] = (x + 1) + y) + ((x + 2) + y) + ((x + 2) + y) + ((x + 3) + y) + ((x + 4) + y) } The call is outside the loop, so folding would not be costly at all. And I fear that the 4 terms would not common up, and so be slower after your change. And I think there are probably other examples. But I have not benchmarked anything, so I could be quite wrong. What exactly is it that gives you the speedup in your benchmark? Spilling? Fewer add instructions? Would be nice to understand that better, and see what are potential examples where we would have regressions with your patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18482#issuecomment-2022308731 From simonis at openjdk.org Wed Mar 27 09:59:51 2024 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 27 Mar 2024 09:59:51 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v3] In-Reply-To: References: Message-ID: > Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). > > The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: > > public static native void emptyStaticNativeMethod(); > > @Benchmark > public static void baseline() { > } > > @Benchmark > public static void staticMethodCallingStatic() { > emptyStaticMethod(); > } > > @Benchmark > public static void staticMethodCallingStaticNative() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:-TieredCompilation") > public static void staticMethodCallingStaticNativeNoTiered() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") > public static void staticMethodCallingStaticNativeIntStub() { > emptyStaticNativeMethod(); > } > > > JDK 11 > ====== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op > > > JDK 17 & 21 > =========== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op > > > The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: > > 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) > @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method > 172 112 n 0 io.simonis.NativeCall::emptyStaticNativeMethod (native... Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: Fix indentation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18496/files - new: https://git.openjdk.org/jdk/pull/18496/files/157e124e..5b017d59 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18496&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18496&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18496.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18496/head:pull/18496 PR: https://git.openjdk.org/jdk/pull/18496 From roland at openjdk.org Wed Mar 27 10:04:44 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 10:04:44 GMT Subject: RFR: 8329163: C2: possible overflow in PhaseIdealLoop::extract_long_range_checks() Message-ID: This change avoids the overflow of `ABS(scale)` when `scale` is `min_jlong`. ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/18508/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18508&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329163 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18508/head:pull/18508 PR: https://git.openjdk.org/jdk/pull/18508 From chagedorn at openjdk.org Wed Mar 27 10:10:23 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 10:10:23 GMT Subject: RFR: 8329163: C2: possible overflow in PhaseIdealLoop::extract_long_range_checks() In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:59:18 GMT, Roland Westrelin wrote: > This change avoids the overflow of `ABS(scale)` when `scale` is `min_jlong`. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18508#pullrequestreview-1962845286 From rcastanedalo at openjdk.org Wed Mar 27 10:10:43 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 27 Mar 2024 10:10:43 GMT Subject: RFR: 8320718: C2: comparison folding disregards pinned stores Message-ID: C2 [tries to fold](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/ifnode.cpp#L1326-L1369) pairs of range check-like comparisons into single unsigned comparisons. In [the case](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/ifnode.cpp#L1356-L1365) where there is a third, unrelated comparison in between the two folding candidates, this optimization does not take into account whether there is any node (e.g. a store) pinned to the successful projection of the dominating candidate. This results in C2 folding the comparisons, and the pinned nodes being allowed to be hoisted above the original comparisons, which can lead to miscompilations. See further details in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8320718). This changeset ensures that two comparisons are not folded when there are nodes pinned to the successful projection of the dominating comparison, regardless of whether the two comparisons are [consecutive](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/ifnode.cpp#L1331-L1347) or [separated](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/ifnode.cpp#L1348-L1366) by a third, unrelated test. The changeset adds negative IR tests checking that comparisons are not folded in the scenario described above, and that, consequently, stores are not hoisted above the comparisons. It also includes the simplified reproducer contributed by @TobiHartmann and a variant of it that reproduces the issue using the default GCM heuristics. #### Testing - tier1-5, stress test, fuzzing (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64; release and debug mode). - Tested that, with the changeset, the original JavaFuzzer-generated test does not fail on 1000 `StressGCM` runs (the failure frequency without the changeset is higher than 50%). - Tested manually that the changeset does not affect the optimizations applied to the `compiler/rangechecks/TestExplicitRangeChecks.java` tests ([this starter RFE](https://bugs.openjdk.org/browse/JDK-8329101) proposes automating this task by adding IR checks to the tests). - Tested that the changeset does not introduce performance regressions on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008). ------------- Commit messages: - Remove TODO note - Update copyright years - Add original test case and a variant that always fails - Fix class name in IR test - Disable if-folding if the dominating if has pinned nodes - Add a couple of negative IR tests Changes: https://git.openjdk.org/jdk/pull/18506/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18506&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8320718 Stats: 185 lines in 4 files changed: 181 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18506.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18506/head:pull/18506 PR: https://git.openjdk.org/jdk/pull/18506 From rcastanedalo at openjdk.org Wed Mar 27 10:32:32 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 27 Mar 2024 10:32:32 GMT Subject: Withdrawn: draft In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:03:33 GMT, Roberto Casta?eda Lozano wrote: > TBD This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/18506 From chagedorn at openjdk.org Wed Mar 27 11:34:56 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 11:34:56 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: Message-ID: > This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. > > #### How `create_bool_from_template_assertion_predicate()` Works > Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: > 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): > https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 > > 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. > > #### Missing Visited Set > The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: > > > ... > | > E > | > D > / \ > B C > \ / > A > > DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... > > With each diamond, the number of revisits of each node above doubles. > > #### Endless DFS in Edge-Cases > In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because background compilation is disabled). > > #### New DFS Implem... Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Change from DFS to 2xBFS - Review Emanuel first part - Merge branch 'master' into JDK-8327110 - 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix pure cloning cases used for Loop Unswitching and Split If ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18293/files - new: https://git.openjdk.org/jdk/pull/18293/files/be2a91d4..dbd0caba Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18293&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18293&range=00-01 Stats: 353492 lines in 2829 files changed: 17205 ins; 12519 del; 323768 mod Patch: https://git.openjdk.org/jdk/pull/18293.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18293/head:pull/18293 PR: https://git.openjdk.org/jdk/pull/18293 From chagedorn at openjdk.org Wed Mar 27 11:34:57 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 11:34:57 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: <130ezssvWCgkOjqeun4yPh5X8ypdumhU1uQLfkW9DV8=.4dbd2fd8-1f0d-4487-bd25-536b28084f32@github.com> Message-ID: On Wed, 20 Mar 2024 16:32:14 GMT, Emanuel Peter wrote: >> Maybe put it in the scope of `TemplateAssertionPredicateExpression` class? Not sure about this idea yet, just an idea. > > Should this not be in `predicates.hpp`, together with its implementations? > Maybe put it in the scope of `TemplateAssertionPredicateExpression` class? Not sure about this idea yet, just an idea. We also need to know about this class in `DataNodeGraph`. Therefore, we cannot make it an inner class. But I've moved it to `predicates.hpp` which is indeed better suited. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1537652316 From chagedorn at openjdk.org Wed Mar 27 11:34:57 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 11:34:57 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: Message-ID: On Thu, 21 Mar 2024 13:32:23 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/predicates.cpp line 301: >> >>> 299: } >>> 300: } >>> 301: }; >> >> I think this is now correct. But it is 100 lines to perform and explain this DFS with backtracking. >> >> On the other hand doing 2 BFS would just be 20+ lines. >> >> >> Unique_Node_List collected; >> >> Unique_Node_List input_traversal; >> input_traversal.push(start_node); >> for (int i = 0; i < input_traversal.length(); i++) { >> Node* n = input_traversal.at(i); >> for (int j = 1; j < n->req(); j++) { >> Node* input = n->in(j); >> if (_is_target_node(input)) { >> collected.push(input); // mark as target, where we start backtracking. >> } else if(_filter(input)) { >> input_traversal.push(input); // continue BFS. >> } >> } >> } >> assert(!collected.is_empty(), "must find some targets"); >> for (int i = 0; i < collected.length(); i++) { >> Node* n = collected.at(i); >> for (output : n->fastout()) { // pseudocode >> if (input_traversal.contains(output)) { >> collected.push(output); // backtrack through nodes of input traversal >> } >> } >> } >> assert(collected.contains(start_node), "must find start node again"); > > But this is a matter of taste. The data structures probably have roughly the same size. And also runtime is probably basically the same. I think you're right. The typical Template Assertion Predicate Expression only has few nodes. So, I totally agree with you that we should go with a much simpler double BFS version instead of a single DFS walk. And if we ever find the need to have a single DFS walk with backtracking where a single walk is expensive, we could still go back to this PR and revive this code. Updated code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1540933983 From chagedorn at openjdk.org Wed Mar 27 11:34:57 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 11:34:57 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: <130ezssvWCgkOjqeun4yPh5X8ypdumhU1uQLfkW9DV8=.4dbd2fd8-1f0d-4487-bd25-536b28084f32@github.com> Message-ID: On Mon, 25 Mar 2024 14:04:21 GMT, Christian Hagedorn wrote: >> Should this not be in `predicates.hpp`, together with its implementations? > >> Maybe put it in the scope of `TemplateAssertionPredicateExpression` class? Not sure about this idea yet, just an idea. > > We also need to know about this class in `DataNodeGraph`. Therefore, we cannot make it an inner class. But I've moved it to `predicates.hpp` which is indeed better suited. > The name could reflect that it is only for template assertion predicates. >From the point of usage, yes. But I'm not sure if we should really squeeze "TemplateAssertionPredicates" into the already quite long name. I'm more inclined to leave it as it is. And the good thing is that `OpaqueLoopNodes` already gives a hint that it's connected to template assertion predicates. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1537657731 From chagedorn at openjdk.org Wed Mar 27 11:34:56 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 11:34:56 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 11:31:46 GMT, Christian Hagedorn wrote: >> This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. >> >> #### How `create_bool_from_template_assertion_predicate()` Works >> Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: >> 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): >> https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 >> >> 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. >> >> #### Missing Visited Set >> The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: >> >> >> ... >> | >> E >> | >> D >> / \ >> B C >> \ / >> A >> >> DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... >> >> With each diamond, the number of revisits of each node above doubles. >> >> #### Endless DFS in Edge-Cases >> In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because... > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Change from DFS to 2xBFS > - Review Emanuel first part > - Merge branch 'master' into JDK-8327110 > - 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix pure cloning cases used for Loop Unswitching and Split If Pushed an update addressing your comments @eme64. ------------- PR Review: https://git.openjdk.org/jdk/pull/18293#pullrequestreview-1957817997 From chagedorn at openjdk.org Wed Mar 27 11:34:57 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 11:34:57 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 16:16:02 GMT, Emanuel Peter wrote: >> Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Change from DFS to 2xBFS >> - Review Emanuel first part >> - Merge branch 'master' into JDK-8327110 >> - 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix pure cloning cases used for Loop Unswitching and Split If > > src/hotspot/share/opto/loopnode.hpp line 1662: > >> 1660: ParsePredicateSuccessProj* fast_loop_parse_predicate_proj, >> 1661: ParsePredicateSuccessProj* slow_loop_parse_predicate_proj); >> 1662: IfProjNode* clone_assertion_predicate_for_unswitched_loops(Node* template_assertion_predicate, IfProjNode* predicate, > > Suggestion: > > IfProjNode* clone_assertion_predicate_for_unswitched_loops(IfNode* template_assertion_predicate, IfProjNode* predicate, > > Could we improve the type? I think all uses have `IfNode*`. Sure, the code will eventually be removed but I guess it does not hurt to still improve the type. > src/hotspot/share/opto/loopnode.hpp line 1937: > >> 1935: } >> 1936: >> 1937: // Create a copy of the provided data node collection by doing the following: > > By "collection" you mean the `_data_nodes`, right? maybe some renaming could be helpful? Updated the comment to make it more clear. The caller of this method should not need to know about the internal `_data_nodes` field. > src/hotspot/share/opto/loopopts.cpp line 4542: > >> 4540: >> 4541: void DataNodeGraph::transform_opaque_node(TransformStrategyForOpaqueLoopNodes& transform_strategy, Node* node) { >> 4542: const uint next_idx = _phase->C->unique(); > > Does this have any use? Good catch, that's an unused leftover - removed. > src/hotspot/share/opto/predicates.cpp line 189: > >> 187: >> 188: NodeCheck _node_filter; // Node filter function to decide if we should process a node or not while searching for targets. >> 189: NodeCheck _is_target_node; // Function to decide if a node is a target node (i.e. where we should start backtracking). > > There should be some remark that all target nodes must pass the filter. Good point, updated the comment > src/hotspot/share/opto/predicates.cpp line 306: > >> 304: Opaque4Node* TemplateAssertionPredicateExpression::clone(TransformStrategyForOpaqueLoopNodes& transform_strategy, >> 305: Node* new_ctrl, PhaseIdealLoop* phase) { >> 306: ResourceMark rm; > > The ResourceMark makes me a bit nervous, in combination of a non-constant `transform_strategy`. > Could the `transform_strategy` be a constant reference, maybe by making its functions also const? Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1537635240 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1537659968 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1537662060 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1537664444 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1537674109 From shade at openjdk.org Wed Mar 27 12:28:23 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 27 Mar 2024 12:28:23 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 05:58:34 GMT, Joshua Cao wrote: > The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. > > This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. > > I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. > > I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). > > Passes hotspot tier1 locally on a Linux machine. > > ### Benchmarks > > Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. > > Baseline: > > Result "org.renaissance.jdk.streams.JmhParMnemonics.run": > N = 25 > mean = 3309.611 ?(99.9%) 86.699 ms/op > > Histogram, ms/op: > [3000.000, 3050.000) = 0 > [3050.000, 3100.000) = 4 > [3100.000, 3150.000) = 1 > [3150.000, 3200.000) = 0 > [3200.000, 3250.000) = 0 > [3250.000, 3300.000) = 0 > [3300.000, 3350.000) = 9 > [3350.000, 3400.000) = 6 > [3400.000, 3450.000) = 5 > > Percentiles, ms/op: > p(0.0000) = 3069.910 ms/op > p(50.0000) = 3348.140 ms/op > ... I think you want to merge from master to get the clean GHA runs. I am running some more tests here too. src/hotspot/share/opto/parse1.cpp line 1002: > 1000: // 3. On processors which are not CPU_MULTI_COPY_ATOMIC (e.g. PPC64), > 1001: // support_IRIW_for_not_multiple_copy_atomic_cpu selects that > 1002: // MemBarStoreStore is used before volatile load instead of after volatile This change is superfluous: `MemBarVolatile` is still emitted before volatile loads in those cases. test/hotspot/jtreg/compiler/c2/irTests/ConstructorBarriers.java line 35: > 33: * @run main compiler.c2.irTests.ConstructorBarriers > 34: */ > 35: public class ConstructorBarriers { I think this test would benefit from additional scalar replacement cases. This would test EA behavior with this barrier. Basically, return the field value, not the instance? test/hotspot/jtreg/compiler/c2/irTests/ConstructorBarriers.java line 36: > 34: */ > 35: public class ConstructorBarriers { > 36: private class ClassBasic { I think these should be `static` to avoid capturing the enclosing class reference accidentally. test/hotspot/jtreg/compiler/c2/irTests/ConstructorBarriers.java line 73: > 71: @IR(counts = {IRNode.MEMBAR_STORESTORE, "1"}) > 72: @IR(counts = {IRNode.MEMBAR_RELEASE, "1"}) > 73: @IR(counts = {IRNode.MEMBAR_VOLATILE, "1"}) As per comment for `IRIW*` architectures, volatile barrier would be expected only on specific platforms. I think we want to `@require` this test as x86_64- and AArch64-specific. ------------- PR Review: https://git.openjdk.org/jdk/pull/18505#pullrequestreview-1962878529 PR Review Comment: https://git.openjdk.org/jdk/pull/18505#discussion_r1540818154 PR Review Comment: https://git.openjdk.org/jdk/pull/18505#discussion_r1540842433 PR Review Comment: https://git.openjdk.org/jdk/pull/18505#discussion_r1541000643 PR Review Comment: https://git.openjdk.org/jdk/pull/18505#discussion_r1540832688 From chagedorn at openjdk.org Wed Mar 27 13:55:37 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 13:55:37 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded Message-ID: The test case shows a problem where data is folded during parsing while control is not. This leaves the graph in a broken state and we fail with an assertion. We have the following (pseudo) code for some class `X`: o = flag ? new Object[] : new byte[]; if (o instanceof X) { X x = (X)o; // checkcast } For the `checkcast`, C2 knows that the type of `o` is some kind of array, i.e. type `[bottom`. But this cannot be a sub type of `X`. Therefore, the `CheckCastPP` node created for the `checkcast` result is replaced by top by the type system. However, the `SubTypeCheckNode` for the `checkcast` is not folded and the graph is broken. The problem of not folding the `SubTypeCheckNode` can be traced back to `SubTypeCheckNode::sub` calling `static_subtype_check()` when transforming the node after it's creation. `static_subtype_check()` should detect that the sub type check is always wrong here: https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/compile.cpp#L4454-L4460 But it does not because these two checks return the following: 1. Check: is `o` a sub type of `X`? -> returns no, so far so good. 2. Check: _could_ `o` be a sub type of `X`? -> returns no which is wrong! `[bottom` is only a sub type of `Object` and can never be a subtype of `X` In `maybe_java_subtype_of_helper_for_arr()`, we wrongly conclude that any array with a base element type `bottom` _could_ be a sub type of anything: https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6462-L6465 But this is only true if the super class is also an array class - but not if `other` (super klass) is an instance klass as in this case. The fix for this is to first check the immediately following check which handles the case of comparing an array klass to an instance klass: An array klass can only ever be a sub class of an instance klass if it's the `Object` class. But in our case, we have `X` and this would return false: https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6466-L6468 The very same problem can also be triggered with `X` being an interface instead. There are tests for both these cases. #### Additionally Required Fix When running with `-XX:+ExpandSubTypeCheckAtParseTime`, we eagerly expand the sub type check during parsing and therefore do not emit a `SubTypeCheckNode`. When additionally running with `-XX:+StressReflectiveCode`, the static sub type performed in `static_subtype_check()` is skipped when emitting the expanded sub type check (`skip` is default-initialized with the flag value of `StressReflectiveCode`): https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/compile.cpp#L4450-L4452 And again, the graph is left in a broken state because the sub type check cannot be folded. I therefore propose to always perform the static sub type check when a sub type check is expanded during parsing (i.e. if `ExpandSubTypeCheckAtParseTime` is set). I've added a run with these flags as well. Thanks, Christian ------------- Commit messages: - 8328702: C2: Crash during parsing because sub type check is not folded Changes: https://git.openjdk.org/jdk/pull/18512/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18512&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328702 Stats: 143 lines in 3 files changed: 139 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18512.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18512/head:pull/18512 PR: https://git.openjdk.org/jdk/pull/18512 From epeter at openjdk.org Wed Mar 27 14:01:33 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 14:01:33 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Tue, 26 Mar 2024 15:36:35 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.cpp line 1394: > >> 1392: >> 1393: SplitStatus PackSet::split_pack(const char* split_name, >> 1394: Node_List* pack, > > Just wondering if you also plan to have a separate `Pack` class at some point instead of using `Node_List`? If it's not worth we might still want to use a typedef to better show what the intent is. But that's something for another day. I will keep that in mind. It could really improve readability. Let's do it in a separate RFE! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541172959 From dnsimon at openjdk.org Wed Mar 27 14:04:46 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 27 Mar 2024 14:04:46 GMT Subject: RFR: 8329191: JVMCI compiler warning is truncated Message-ID: $ java -Djdk.graal.CompilerConfiguration=XXcommunity HelloWorld [0.035s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterp The above message is truncated. It should be: [0.032s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterprise, community, economy]"). Use -Xlog:jit+compilation for more detail. This PR fixes this by using `stringStream` instead of `err_msg` when creating these messages. ------------- Commit messages: - mitigate against truncation of JVMCI error messages Changes: https://git.openjdk.org/jdk/pull/18513/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18513&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329191 Stats: 10 lines in 2 files changed: 6 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18513.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18513/head:pull/18513 PR: https://git.openjdk.org/jdk/pull/18513 From epeter at openjdk.org Wed Mar 27 14:39:35 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 14:39:35 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Tue, 26 Mar 2024 15:50:29 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.cpp line 1539: > >> 1537: >> 1538: // Split packs at boundaries where left and right have different use or def packs. >> 1539: void SuperWord::split_packs_at_use_def_boundaries() { > > As above with `PairSet`, I'm also wondering here if these `split` methods could be part of `PackSet`? But I have not checked all the method calls and it could very well be that you would need to pass a reference to `SuperWord` to `PackSet`. From a high-level view, "split packs" feels like an operation you perform on a pack set. Same, for example, as `SuperWord::verify_packs()`. > > Either way, I think the patch is already quite big and I think we should do that - if wanted - separately Yes, let's consider that for a future RFE. But for now I can say this: We moved the implementation of `split / filter` to the packset, and that makes sense: we can hide the packset internals, and don't have them spill out to the SuperWord code. We can just call `_packset.split_packs` with a `split_strategy`. But defining the split strategy itself depends on other SuperWord components, and so I think conceptually they belong with SuperWord. They query `reductions`, `dependency graph`, `implemented`, `AlignmentSolution`, `profitable`, etc. Sure, we can just pass a SuperWord reference, but that does not really seem right to me either. For me, the SuperWord class is there to manage the interface between all the components, and try to avoid passing components to other components, wherever possible. So I would rather have a list of methods in SuperWord, and each such method defines how the components interact (e.g. packset and alignment, packset and AlignmentSolution, pairset and packset, ...). But we can discuss this further, and maybe come up with an even better solution. My hope is just that we separate the components as much as possible, so that we know that only a handful of them interact at a given point. That makes the whole beast more managable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541244299 From epeter at openjdk.org Wed Mar 27 14:50:35 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 14:50:35 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Wed, 27 Mar 2024 08:18:06 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.cpp line 2084: > >> 2082: // Create nodes (from packs and scalar-nodes), and add edges, based on the dependency graph. >> 2083: void build() { >> 2084: const PackSet& packset = _slp->packset(); > > You could also store the pack set as field since you access it several times. You are right, I could actually remove the SuperWord reference from the `PacksetGraph` completely, and just pass the components, as `const` references. That would be really nice. If it is ok for you, I will do this in a future RFE. Actually, I plan to completely overhaul the `PacksetGraph`. It will be transformed to the `VTransformGraph`, and it will do: - Cycle checking (like today) - Evaluate the cost-model - Execute: each node knows how to replace its packed scalar nodes with vector nodes (basically refactoring `SuperWord::output` away) - etc. We may even be able to take the `VTransformGraph` and try to widen all nodes, or make transformations on this graph, a simplified version of IGVN. I have lots of ideas that would be unlocked with this new graph-based approach. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541269988 From epeter at openjdk.org Wed Mar 27 15:03:51 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 15:03:51 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v5] In-Reply-To: References: Message-ID: <5Aj8EVqPcEEhAdMPHrhikXWew8l6SjyF_LBm8n4vCMc=.60343078-8520-48e0-8601-3af302144233@github.com> > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: suggestions from Christian, batch 1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18276/files - new: https://git.openjdk.org/jdk/pull/18276/files/d4136bba..72b72288 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=03-04 Stats: 76 lines in 2 files changed: 24 ins; 7 del; 45 mod Patch: https://git.openjdk.org/jdk/pull/18276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18276/head:pull/18276 PR: https://git.openjdk.org/jdk/pull/18276 From epeter at openjdk.org Wed Mar 27 15:03:51 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 15:03:51 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Wed, 27 Mar 2024 08:19:21 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.cpp line 2404: > >> 2402: for (int i = 0; i < body().length(); i++) { >> 2403: Node* n = body().at(i); >> 2404: Node_List* p = _packset.pack(n); > > Since you use this pattern a lot, you could also think about having a `SuperWord::pack()` method that delegates to `_packset.pack()`. Hmm. I tried it, but then we often also have a variable `pack`. So I now changed the `pack(n)` to `get_pack(n)`. I think that is better anyway, it suggests that we "get" something, rather than "pack" something. > src/hotspot/share/opto/superword.cpp line 3754: > >> 3752: } >> 3753: >> 3754: void PackSet::print_pack(Node_List* pack) const { > > Could also be made static since you don't access any fields. This is where a `Pack` class would eventually come in quite nicely. But I will change it to static for now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541288141 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541290651 From roland at openjdk.org Wed Mar 27 15:04:23 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 15:04:23 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded In-Reply-To: References: Message-ID: <4ns0K8FAgoDXONbLGjVKsLj1SR5RoFXm9ZdKbrPcGc0=.abbd2b58-97a0-489c-9beb-046318685a65@github.com> On Wed, 27 Mar 2024 13:48:09 GMT, Christian Hagedorn wrote: > The test case shows a problem where data is folded during parsing while control is not. This leaves the graph in a broken state and we fail with an assertion. > > We have the following (pseudo) code for some class `X`: > > o = flag ? new Object[] : new byte[]; > if (o instanceof X) { > X x = (X)o; // checkcast > } > > For the `checkcast`, C2 knows that the type of `o` is some kind of array, i.e. type `[bottom`. But this cannot be a sub type of `X`. Therefore, the `CheckCastPP` node created for the `checkcast` result is replaced by top by the type system. However, the `SubTypeCheckNode` for the `checkcast` is not folded and the graph is broken. > > The problem of not folding the `SubTypeCheckNode` can be traced back to `SubTypeCheckNode::sub` calling `static_subtype_check()` when transforming the node after it's creation. `static_subtype_check()` should detect that the sub type check is always wrong here: > https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/compile.cpp#L4454-L4460 > > But it does not because these two checks return the following: > 1. Check: is `o` a sub type of `X`? -> returns no, so far so good. > 2. Check: _could_ `o` be a sub type of `X`? -> returns no which is wrong! `[bottom` is only a sub type of `Object` and can never be a subtype of `X` > > In `maybe_java_subtype_of_helper_for_arr()`, we wrongly conclude that any array with a base element type `bottom` _could_ be a sub type of anything: > https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6462-L6465 > But this is only true if the super class is also an array class - but not if `other` (super klass) is an instance klass as in this case. > > The fix for this is to first check the immediately following check which handles the case of comparing an array klass to an instance klass: An array klass can only ever be a sub class of an instance klass if it's the `Object` class. But in our case, we have `X` and this would return false: > > https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6466-L6468 > > The very same problem can also be triggered with `X` being an interface instead. There are tests for both these cases. > > #### Additionally Required Fix > When running with `-XX:+ExpandSubTypeCheckAtParseTime`, we eagerly expand the sub type check during parsing and therefore do not emit a `SubTypeCheckNode`. When additionally running with `-XX:+StressReflectiveCode`, th... Looks good to me. > When running with -XX:+ExpandSubTypeCheckAtParseTime Do we want to retire `ExpandSubTypeCheckAtParseTime`? Is there any reason to keep it? src/hotspot/share/opto/type.cpp line 6465: > 6463: } > 6464: if (this_one->is_instance_type(other)) { > 6465: return other->klass()->equals(ciEnv::current()->Object_klass()) && other->_interfaces->intersection_with(this_one->_interfaces)->eq(other->_interfaces); `TypeInterfaces` has a `contains` method that does `intersection_with` + `eq`. Could we use it here? i.e. `this_one->_interfaces->contains(other->_interfaces)` ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18512#pullrequestreview-1963542605 PR Comment: https://git.openjdk.org/jdk/pull/18512#issuecomment-2022999693 PR Review Comment: https://git.openjdk.org/jdk/pull/18512#discussion_r1541245665 From chagedorn at openjdk.org Wed Mar 27 15:21:33 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 15:21:33 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class Message-ID: While working on a [Valhalla bug](https://bugs.openjdk.org/browse/JDK-8321734), I've noticed that a `SubTypeCheckNode` for a `checkcast` does not take a unique concrete sub class `X` of an abstract class `A` as klass constant in the sub type check. Instead, it uses the abstract klass constant: abstract class A {} class X extends A {} A x = (A)object; // Emits SubTypeCheckNode(object, A), but could have used X instead of A. However, the `CheckCastPP` result already uses the improved instance type ptr `X` (i.e. `toop` which was improved from `A` by calling `try_improve()` to get the unique concrete sub class): https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3257-L3261 https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3363 We should also plug in a unique concrete sub class constant in the `SubTypeCheckNode` which could be beneficial to fold away redundant sub type checks (see test cases). This fix is required to completely fix the bug in Valhalla (this is only one of the broken cases). In Valhalla, the graph ends up being broken because a `CheckCastPP` node is folded because of an impossible type but the `SubTypeCheckNode` is not due to not using the improved unique concrete sub class constant for the `checkcast`. I don't think that there is currently a bug in mainline because of this limitation - it just blocks some optimizations. I'm therefore upstreaming this fix to mainline since it can be beneficial to have this fix here as well (see test cases). Thanks, Christian ------------- Commit messages: - 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class Changes: https://git.openjdk.org/jdk/pull/18515/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18515&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328480 Stats: 85 lines in 2 files changed: 78 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/18515.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18515/head:pull/18515 PR: https://git.openjdk.org/jdk/pull/18515 From epeter at openjdk.org Wed Mar 27 15:25:33 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 15:25:33 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Wed, 27 Mar 2024 08:32:58 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.hpp line 69: > >> 67: // Doubly-linked pairs. If not linked: -1 >> 68: GrowableArray _left_to_right; // bb_idx -> bb_idx >> 69: GrowableArray _right_to_left; // bb_idx -> bb_idx > > I think it's a good solution but still found myself revisiting this several times while looking at the methods below how it works. Would it maybe help to give a visual example? For example: > > > left_to_right: > index: 0 1 2 3 > value: | -1 | 3 | -1 | -1 | ... > > => Node with bb_idx 1 is in the left slot of a pair which has the node with bb_idx 3 in the right slot. > => Nodes with bb_idx 0, 2, and 3 are not found in a left slot of any pair. > > right_to_left: > index: 0 1 2 3 > value: | -1 | -1 | -1 | 1 | ... > > => Node with bb_idx 3 is in the right slot of a pair which has the node with bb_idx 1 in the left slot. > => Nodes with bb_idx 0, 1, and 2 are not found in a right slot of any pair. > ``` Great idea, I made a slightly more complex example, inspired by yours! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541332549 From epeter at openjdk.org Wed Mar 27 15:41:35 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 15:41:35 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Wed, 27 Mar 2024 09:01:19 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > src/hotspot/share/opto/superword.hpp line 263: > >> 261: const VLoopBody& _body; >> 262: >> 263: // The "packset" proper: an array of "packs" > > What do you mean by " The "packset" proper"? Changed comment to `// Set of all packs:` > src/hotspot/share/opto/superword.hpp line 488: > >> 486: bool do_vector_loop() { return _do_vector_loop; } >> 487: >> 488: const PackSet& packset() const { return _packset; } > > Somehow a strange alignment. You might want to fix that. Had it aligned with `do_vector_loop`. But it seems unnecessary, changed it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541357933 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541360704 From chagedorn at openjdk.org Wed Mar 27 15:47:21 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 15:47:21 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 13:48:09 GMT, Christian Hagedorn wrote: > The test case shows a problem where data is folded during parsing while control is not. This leaves the graph in a broken state and we fail with an assertion. > > We have the following (pseudo) code for some class `X`: > > o = flag ? new Object[] : new byte[]; > if (o instanceof X) { > X x = (X)o; // checkcast > } > > For the `checkcast`, C2 knows that the type of `o` is some kind of array, i.e. type `[bottom`. But this cannot be a sub type of `X`. Therefore, the `CheckCastPP` node created for the `checkcast` result is replaced by top by the type system. However, the `SubTypeCheckNode` for the `checkcast` is not folded and the graph is broken. > > The problem of not folding the `SubTypeCheckNode` can be traced back to `SubTypeCheckNode::sub` calling `static_subtype_check()` when transforming the node after it's creation. `static_subtype_check()` should detect that the sub type check is always wrong here: > https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/compile.cpp#L4454-L4460 > > But it does not because these two checks return the following: > 1. Check: is `o` a sub type of `X`? -> returns no, so far so good. > 2. Check: _could_ `o` be a sub type of `X`? -> returns no which is wrong! `[bottom` is only a sub type of `Object` and can never be a subtype of `X` > > In `maybe_java_subtype_of_helper_for_arr()`, we wrongly conclude that any array with a base element type `bottom` _could_ be a sub type of anything: > https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6462-L6465 > But this is only true if the super class is also an array class - but not if `other` (super klass) is an instance klass as in this case. > > The fix for this is to first check the immediately following check which handles the case of comparing an array klass to an instance klass: An array klass can only ever be a sub class of an instance klass if it's the `Object` class. But in our case, we have `X` and this would return false: > > https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6466-L6468 > > The very same problem can also be triggered with `X` being an interface instead. There are tests for both these cases. > > #### Additionally Required Fix > When running with `-XX:+ExpandSubTypeCheckAtParseTime`, we eagerly expand the sub type check during parsing and therefore do not emit a `SubTypeCheckNode`. When additionally running with `-XX:+StressReflectiveCode`, th... Thanks Roland for your review! > > When running with -XX:+ExpandSubTypeCheckAtParseTime > > Do we want to retire `ExpandSubTypeCheckAtParseTime`? Is there any reason to keep it? I'm not sure about how much benefit it gives us. A quick JBS search for "ExpandSubTypeCheckAtParseTime" revealed a few issues - but would need to double check how many of them really only triggered with that flag and were real bugs. So, apart from having it as a stress option, I don't see a real benefit for it - but that might be a good enough reason to keep it for now. What do you think? ------------- PR Review: https://git.openjdk.org/jdk/pull/18512#pullrequestreview-1963696677 From chagedorn at openjdk.org Wed Mar 27 15:47:22 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 15:47:22 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded In-Reply-To: <4ns0K8FAgoDXONbLGjVKsLj1SR5RoFXm9ZdKbrPcGc0=.abbd2b58-97a0-489c-9beb-046318685a65@github.com> References: <4ns0K8FAgoDXONbLGjVKsLj1SR5RoFXm9ZdKbrPcGc0=.abbd2b58-97a0-489c-9beb-046318685a65@github.com> Message-ID: On Wed, 27 Mar 2024 14:37:05 GMT, Roland Westrelin wrote: >> The test case shows a problem where data is folded during parsing while control is not. This leaves the graph in a broken state and we fail with an assertion. >> >> We have the following (pseudo) code for some class `X`: >> >> o = flag ? new Object[] : new byte[]; >> if (o instanceof X) { >> X x = (X)o; // checkcast >> } >> >> For the `checkcast`, C2 knows that the type of `o` is some kind of array, i.e. type `[bottom`. But this cannot be a sub type of `X`. Therefore, the `CheckCastPP` node created for the `checkcast` result is replaced by top by the type system. However, the `SubTypeCheckNode` for the `checkcast` is not folded and the graph is broken. >> >> The problem of not folding the `SubTypeCheckNode` can be traced back to `SubTypeCheckNode::sub` calling `static_subtype_check()` when transforming the node after it's creation. `static_subtype_check()` should detect that the sub type check is always wrong here: >> https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/compile.cpp#L4454-L4460 >> >> But it does not because these two checks return the following: >> 1. Check: is `o` a sub type of `X`? -> returns no, so far so good. >> 2. Check: _could_ `o` be a sub type of `X`? -> returns no which is wrong! `[bottom` is only a sub type of `Object` and can never be a subtype of `X` >> >> In `maybe_java_subtype_of_helper_for_arr()`, we wrongly conclude that any array with a base element type `bottom` _could_ be a sub type of anything: >> https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6462-L6465 >> But this is only true if the super class is also an array class - but not if `other` (super klass) is an instance klass as in this case. >> >> The fix for this is to first check the immediately following check which handles the case of comparing an array klass to an instance klass: An array klass can only ever be a sub class of an instance klass if it's the `Object` class. But in our case, we have `X` and this would return false: >> >> https://github.com/openjdk/jdk/blob/d0a265039a36292d87b249af0e8977982e5acc7b/src/hotspot/share/opto/type.cpp#L6466-L6468 >> >> The very same problem can also be triggered with `X` being an interface instead. There are tests for both these cases. >> >> #### Additionally Required Fix >> When running with `-XX:+ExpandSubTypeCheckAtParseTime`, we eagerly expand the sub type check during parsing and therefore do not emit a `SubTypeCheckNode`. Wh... > > src/hotspot/share/opto/type.cpp line 6465: > >> 6463: } >> 6464: if (this_one->is_instance_type(other)) { >> 6465: return other->klass()->equals(ciEnv::current()->Object_klass()) && other->_interfaces->intersection_with(this_one->_interfaces)->eq(other->_interfaces); > > `TypeInterfaces` has a `contains` method that does `intersection_with` + `eq`. Could we use it here? i.e. `this_one->_interfaces->contains(other->_interfaces)` That would definitely be better but I've seen that there are three other uses of `intersection_with` + `eq`. We should probably update them all together but not sure if I should squeeze this in here. Should I follow up with an RFE? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18512#discussion_r1541342059 From epeter at openjdk.org Wed Mar 27 15:48:50 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 15:48:50 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v6] In-Reply-To: References: Message-ID: > I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. > > Most importantly: I split it into two classes: `PairSet` and `PackSet`. > `combine_pairs_to_longer_packs` converts the first into the second. > > I was able to simplify the combining, and remove the pack-sorting. > I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. > > I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. > > I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: > Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). > > But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. > > More details are described in the annotations in the code. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: more updates for Christian, batch 2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18276/files - new: https://git.openjdk.org/jdk/pull/18276/files/72b72288..152c66e2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18276&range=04-05 Stats: 70 lines in 3 files changed: 43 ins; 0 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/18276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18276/head:pull/18276 PR: https://git.openjdk.org/jdk/pull/18276 From epeter at openjdk.org Wed Mar 27 15:52:44 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 15:52:44 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Wed, 27 Mar 2024 09:08:35 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> use left/right instead of s1/s2 in some obvious simple places > > Thanks for the updates. The PR generally looks good! I now fully reviewed it and left some more comments. @chhagedorn thanks very much for the review, there were some great suggestions! I fixed almost all, and left some responses at the others. I'll repeat them here: ------------ You are right, I could actually remove the SuperWord reference from the PacksetGraph completely, and just pass the components, as const references. That would be really nice. If it is ok for you, I will do this in a future RFE. Actually, I plan to completely overhaul the PacksetGraph. It will be transformed to the VTransformGraph, and it will do: Cycle checking (like today) Evaluate the cost-model Execute: each node knows how to replace its packed scalar nodes with vector nodes (basically refactoring SuperWord::output away) etc. We may even be able to take the VTransformGraph and try to widen all nodes, or make transformations on this graph, a simplified version of IGVN. I have lots of ideas that would be unlocked with this new graph-based approach. ------------ We moved the implementation of split / filter to the packset, and that makes sense: we can hide the packset internals, and don't have them spill out to the SuperWord code. We can just call _packset.split_packs with a split_strategy. But defining the split strategy itself depends on other SuperWord components, and so I think conceptually they belong with SuperWord. They query reductions, dependency graph, implemented, AlignmentSolution, profitable, etc. Sure, we can just pass a SuperWord reference, but that does not really seem right to me either. For me, the SuperWord class is there to manage the interface between all the components, and try to avoid passing components to other components, wherever possible. So I would rather have a list of methods in SuperWord, and each such method defines how the components interact (e.g. packset and alignment, packset and AlignmentSolution, pairset and packset, ...). But we can discuss this further, and maybe come up with an even better solution. My hope is just that we separate the components as much as possible, so that we know that only a handful of them interact at a given point. That makes the whole beast more managable. ---------- Using `Pack` instead of `Node_List`. Great idea, would improve readability. Let's do it in a future RFE. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18276#issuecomment-2023112024 From shade at openjdk.org Wed Mar 27 15:56:22 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 27 Mar 2024 15:56:22 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 05:58:34 GMT, Joshua Cao wrote: > The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. > > This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. > > I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. > > I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). > > Passes hotspot tier1 locally on a Linux machine. > > ### Benchmarks > > Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. > > Baseline: > > Result "org.renaissance.jdk.streams.JmhParMnemonics.run": > N = 25 > mean = 3309.611 ?(99.9%) 86.699 ms/op > > Histogram, ms/op: > [3000.000, 3050.000) = 0 > [3050.000, 3100.000) = 4 > [3100.000, 3150.000) = 1 > [3150.000, 3200.000) = 0 > [3200.000, 3250.000) = 0 > [3250.000, 3300.000) = 0 > [3300.000, 3350.000) = 9 > [3350.000, 3400.000) = 6 > [3400.000, 3450.000) = 5 > > Percentiles, ms/op: > p(0.0000) = 3069.910 ms/op > p(50.0000) = 3348.140 ms/op > ... I propose we also add this benchmark that verifies barrier costs and coalescing: [ConstructorBarriers.txt](https://github.com/openjdk/jdk/files/14775850/ConstructorBarriers.txt). Maybe these also should be the IR tests. The benchmarks show that most combinations with `final`-s improve, and scalar replaced objects also still work (and probably eliminate all the barriers). On my Graviton 3 instance: Benchmark Mode Cnt Score Error Units # Before ConstructorBarriers.escaping_finalFinal avgt 9 9.097 ? 0.032 ns/op ConstructorBarriers.escaping_finalPlain avgt 9 9.120 ? 0.101 ns/op ConstructorBarriers.escaping_finalVolatile avgt 9 11.590 ? 0.088 ns/op ConstructorBarriers.escaping_plainFinal avgt 9 9.113 ? 0.037 ns/op ConstructorBarriers.escaping_plainPlain avgt 9 7.627 ? 0.155 ns/op ConstructorBarriers.escaping_plainVolatile avgt 9 13.055 ? 0.180 ns/op ConstructorBarriers.escaping_volatileFinal avgt 9 10.650 ? 0.112 ns/op ConstructorBarriers.escaping_volatilePlain avgt 9 13.074 ? 0.156 ns/op ConstructorBarriers.escaping_volatileVolatile avgt 9 13.546 ? 0.100 ns/op ConstructorBarriers.non_escaping_finalFinal avgt 9 2.220 ? 0.006 ns/op ConstructorBarriers.non_escaping_finalPlain avgt 9 2.214 ? 0.014 ns/op ConstructorBarriers.non_escaping_finalVolatile avgt 9 2.232 ? 0.035 ns/op ConstructorBarriers.non_escaping_plainFinal avgt 9 2.222 ? 0.004 ns/op ConstructorBarriers.non_escaping_plainPlain avgt 9 2.234 ? 0.036 ns/op ConstructorBarriers.non_escaping_plainVolatile avgt 9 2.230 ? 0.019 ns/op ConstructorBarriers.non_escaping_volatileFinal avgt 9 2.232 ? 0.018 ns/op ConstructorBarriers.non_escaping_volatilePlain avgt 9 2.220 ? 0.033 ns/op ConstructorBarriers.non_escaping_volatileVolatile avgt 9 2.232 ? 0.019 ns/op # After ConstructorBarriers.escaping_finalFinal avgt 9 5.939 ? 0.035 ns/op ; improves ConstructorBarriers.escaping_finalPlain avgt 9 5.945 ? 0.033 ns/op ; improves ConstructorBarriers.escaping_finalVolatile avgt 9 10.997 ? 0.050 ns/op ; improves ConstructorBarriers.escaping_plainFinal avgt 9 5.923 ? 0.061 ns/op ; improves ConstructorBarriers.escaping_plainPlain avgt 9 7.687 ? 0.101 ns/op ConstructorBarriers.escaping_plainVolatile avgt 9 13.039 ? 0.206 ns/op ConstructorBarriers.escaping_volatileFinal avgt 9 10.568 ? 0.104 ns/op ConstructorBarriers.escaping_volatilePlain avgt 9 13.061 ? 0.158 ns/op ConstructorBarriers.escaping_volatileVolatile avgt 9 13.572 ? 0.174 ns/op ConstructorBarriers.non_escaping_finalFinal avgt 9 2.212 ? 0.019 ns/op ConstructorBarriers.non_escaping_finalPlain avgt 9 2.231 ? 0.041 ns/op ConstructorBarriers.non_escaping_finalVolatile avgt 9 2.239 ? 0.045 ns/op ConstructorBarriers.non_escaping_plainFinal avgt 9 2.224 ? 0.018 ns/op ConstructorBarriers.non_escaping_plainPlain avgt 9 2.214 ? 0.024 ns/op ConstructorBarriers.non_escaping_plainVolatile avgt 9 2.226 ? 0.029 ns/op ConstructorBarriers.non_escaping_volatileFinal avgt 9 2.239 ? 0.029 ns/op ConstructorBarriers.non_escaping_volatilePlain avgt 9 2.230 ? 0.039 ns/op ConstructorBarriers.non_escaping_volatileVolatile avgt 9 2.235 ? 0.030 ns/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/18505#issuecomment-2023118942 From roland at openjdk.org Wed Mar 27 16:08:25 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 16:08:25 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 15:44:46 GMT, Christian Hagedorn wrote: > I'm not sure about how much benefit it gives us. A quick JBS search for "ExpandSubTypeCheckAtParseTime" revealed a few issues - but would need to double check how many of them really only triggered with that flag and were real bugs. So, apart from having it as a stress option, I don't see a real benefit for it - but that might be a good enough reason to keep it for now. > > What do you think? It also has a maintenance cost (you had to make a code change for it in this PR and I also remember having to take `ExpandSubTypeCheckAtParseTime` into consideration at some point). I would vote for removing it unless it's known to have some value. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18512#issuecomment-2023150157 From roland at openjdk.org Wed Mar 27 16:08:26 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 16:08:26 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded In-Reply-To: References: <4ns0K8FAgoDXONbLGjVKsLj1SR5RoFXm9ZdKbrPcGc0=.abbd2b58-97a0-489c-9beb-046318685a65@github.com> Message-ID: On Wed, 27 Mar 2024 15:27:13 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/type.cpp line 6465: >> >>> 6463: } >>> 6464: if (this_one->is_instance_type(other)) { >>> 6465: return other->klass()->equals(ciEnv::current()->Object_klass()) && other->_interfaces->intersection_with(this_one->_interfaces)->eq(other->_interfaces); >> >> `TypeInterfaces` has a `contains` method that does `intersection_with` + `eq`. Could we use it here? i.e. `this_one->_interfaces->contains(other->_interfaces)` > > That would definitely be better but I've seen that there are three other uses of `intersection_with` + `eq`. We should probably update them all together but not sure if I should squeeze this in here. Should I follow up with an RFE? That sounds good to me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18512#discussion_r1541408422 From kvn at openjdk.org Wed Mar 27 16:10:24 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 16:10:24 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:59:51 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Fix indentation We need a regression test to catch this in a future. ------------- PR Review: https://git.openjdk.org/jdk/pull/18496#pullrequestreview-1963807051 From iveresov at openjdk.org Wed Mar 27 16:10:24 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 27 Mar 2024 16:10:24 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:59:51 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Fix indentation Have you found out why exact are the wrappers not created with `-XX:-TieredCompilation` ? I don't see any conditional with `is_native()` that would prevent that? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18496#issuecomment-2023156349 From chagedorn at openjdk.org Wed Mar 27 16:10:37 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 16:10:37 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v6] In-Reply-To: References: Message-ID: <2s2FgHKlsbPidpGkoCG8nqSjih3zfk_JGLNOBlqc_zo=.e4f405ce-f34c-4990-aaaa-d9ec4c1210a0@github.com> On Wed, 27 Mar 2024 15:48:50 GMT, Emanuel Peter wrote: >> I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. >> >> Most importantly: I split it into two classes: `PairSet` and `PackSet`. >> `combine_pairs_to_longer_packs` converts the first into the second. >> >> I was able to simplify the combining, and remove the pack-sorting. >> I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. >> >> I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. >> >> I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: >> Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). >> >> But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. >> >> More details are described in the annotations in the code. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > more updates for Christian, batch 2 Thanks for doing all the updates. It looks good to me now! src/hotspot/share/opto/superword.hpp line 353: > 351: unmap_node_in_pack(old_pack->at(i)); > 352: } > 353: } You even added a `remap` method and added some asserts, nice! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18276#pullrequestreview-1963810973 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541412358 From chagedorn at openjdk.org Wed Mar 27 16:10:39 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 16:10:39 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v4] In-Reply-To: References: <6uACbwb1mo1Fnp2TWFySckhPMPHTns2rTSQSf-GZBE8=.0ef27dbd-8507-4e29-afc9-4b61e3c4b81d@github.com> Message-ID: On Wed, 27 Mar 2024 13:58:17 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/superword.cpp line 1394: >> >>> 1392: >>> 1393: SplitStatus PackSet::split_pack(const char* split_name, >>> 1394: Node_List* pack, >> >> Just wondering if you also plan to have a separate `Pack` class at some point instead of using `Node_List`? If it's not worth we might still want to use a typedef to better show what the intent is. But that's something for another day. > > I will keep that in mind. It could really improve readability. Let's do it in a separate RFE! Sounds good! >> src/hotspot/share/opto/superword.cpp line 1539: >> >>> 1537: >>> 1538: // Split packs at boundaries where left and right have different use or def packs. >>> 1539: void SuperWord::split_packs_at_use_def_boundaries() { >> >> As above with `PairSet`, I'm also wondering here if these `split` methods could be part of `PackSet`? But I have not checked all the method calls and it could very well be that you would need to pass a reference to `SuperWord` to `PackSet`. From a high-level view, "split packs" feels like an operation you perform on a pack set. Same, for example, as `SuperWord::verify_packs()`. >> >> Either way, I think the patch is already quite big and I think we should do that - if wanted - separately > > Yes, let's consider that for a future RFE. > > But for now I can say this: > We moved the implementation of `split / filter` to the packset, and that makes sense: we can hide the packset internals, and don't have them spill out to the SuperWord code. We can just call `_packset.split_packs` with a `split_strategy`. But defining the split strategy itself depends on other SuperWord components, and so I think conceptually they belong with SuperWord. They query `reductions`, `dependency graph`, `implemented`, `AlignmentSolution`, `profitable`, etc. > Sure, we can just pass a SuperWord reference, but that does not really seem right to me either. For me, the SuperWord class is there to manage the interface between all the components, and try to avoid passing components to other components, wherever possible. So I would rather have a list of methods in SuperWord, and each such method defines how the components interact (e.g. packset and alignment, packset and AlignmentSolution, pairset and packset, ...). > But we can discuss this further, and maybe come up with an even better solution. My hope is just that we separate the components as much as possible, so that we know that only a handful of them interact at a given point. That makes the whole beast more managable. That's a valid point. Let's have another look at `SuperWord`, `PairSet` and `PackSet` once the code stabilizes to check whether it needs improvements or not by moving some code to the `*Set` classes. >> src/hotspot/share/opto/superword.cpp line 2084: >> >>> 2082: // Create nodes (from packs and scalar-nodes), and add edges, based on the dependency graph. >>> 2083: void build() { >>> 2084: const PackSet& packset = _slp->packset(); >> >> You could also store the pack set as field since you access it several times. > > You are right, I could actually remove the SuperWord reference from the `PacksetGraph` completely, and just pass the components, as `const` references. That would be really nice. > > If it is ok for you, I will do this in a future RFE. > > Actually, I plan to completely overhaul the `PacksetGraph`. It will be transformed to the `VTransformGraph`, and it will do: > - Cycle checking (like today) > - Evaluate the cost-model > - Execute: each node knows how to replace its packed scalar nodes with vector nodes (basically refactoring `SuperWord::output` away) > - etc. > > We may even be able to take the `VTransformGraph` and try to widen all nodes, or make transformations on this graph, a simplified version of IGVN. I have lots of ideas that would be unlocked with this new graph-based approach. Sounds exciting! > If it is ok for you, I will do this in a future RFE. Sure, let's leave this as it is now. >> src/hotspot/share/opto/superword.cpp line 2404: >> >>> 2402: for (int i = 0; i < body().length(); i++) { >>> 2403: Node* n = body().at(i); >>> 2404: Node_List* p = _packset.pack(n); >> >> Since you use this pattern a lot, you could also think about having a `SuperWord::pack()` method that delegates to `_packset.pack()`. > > Hmm. I tried it, but then we often also have a variable `pack`. So I now changed the `pack(n)` to `get_pack(n)`. I think that is better anyway, it suggests that we "get" something, rather than "pack" something. Good point, `pack()` could be understood as query or command. `get_pack()` makes it clear ? >> src/hotspot/share/opto/superword.hpp line 69: >> >>> 67: // Doubly-linked pairs. If not linked: -1 >>> 68: GrowableArray _left_to_right; // bb_idx -> bb_idx >>> 69: GrowableArray _right_to_left; // bb_idx -> bb_idx >> >> I think it's a good solution but still found myself revisiting this several times while looking at the methods below how it works. Would it maybe help to give a visual example? For example: >> >> >> left_to_right: >> index: 0 1 2 3 >> value: | -1 | 3 | -1 | -1 | ... >> >> => Node with bb_idx 1 is in the left slot of a pair which has the node with bb_idx 3 in the right slot. >> => Nodes with bb_idx 0, 2, and 3 are not found in a left slot of any pair. >> >> right_to_left: >> index: 0 1 2 3 >> value: | -1 | -1 | -1 | 1 | ... >> >> => Node with bb_idx 3 is in the right slot of a pair which has the node with bb_idx 1 in the left slot. >> => Nodes with bb_idx 0, 1, and 2 are not found in a right slot of any pair. >> ``` > > Great idea, I made a slightly more complex example, inspired by yours! Cool, thanks for adding an even more extensive example. I like it and helps with understanding the idea better :-) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541399343 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541404378 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541406166 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541407489 PR Review Comment: https://git.openjdk.org/jdk/pull/18276#discussion_r1541408431 From kvn at openjdk.org Wed Mar 27 16:17:26 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 16:17:26 GMT Subject: RFR: 8329163: C2: possible overflow in PhaseIdealLoop::extract_long_range_checks() In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:59:18 GMT, Roland Westrelin wrote: > This change avoids the overflow of `ABS(scale)` when `scale` is `min_jlong`. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18508#pullrequestreview-1963821777 From roland at openjdk.org Wed Mar 27 16:17:27 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 16:17:27 GMT Subject: RFR: 8329163: C2: possible overflow in PhaseIdealLoop::extract_long_range_checks() In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 10:08:09 GMT, Christian Hagedorn wrote: >> This change avoids the overflow of `ABS(scale)` when `scale` is `min_jlong`. > > Looks good! Thanks for the reviews @chhagedorn @vnkozlov ------------- PR Comment: https://git.openjdk.org/jdk/pull/18508#issuecomment-2023168673 From roland at openjdk.org Wed Mar 27 16:17:27 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 16:17:27 GMT Subject: Integrated: 8329163: C2: possible overflow in PhaseIdealLoop::extract_long_range_checks() In-Reply-To: References: Message-ID: <2G2Dhz4FPsxzicD6LZ3pcZOic1yNqgMfZpTVLTZtWZ4=.8d0c2333-9b84-4940-bcf7-50f0a9e3c635@github.com> On Wed, 27 Mar 2024 09:59:18 GMT, Roland Westrelin wrote: > This change avoids the overflow of `ABS(scale)` when `scale` is `min_jlong`. This pull request has now been integrated. Changeset: 05854fd7 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/05854fd704cba6ebd73007d9547a064891d49587 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8329163: C2: possible overflow in PhaseIdealLoop::extract_long_range_checks() Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18508 From roland at openjdk.org Wed Mar 27 16:18:27 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 16:18:27 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 15:17:36 GMT, Christian Hagedorn wrote: > While working on a [Valhalla bug](https://bugs.openjdk.org/browse/JDK-8321734), I've noticed that a `SubTypeCheckNode` for a `checkcast` does not take a unique concrete sub class `X` of an abstract class `A` as klass constant in the sub type check. Instead, it uses the abstract klass constant: > > > abstract class A {} > class X extends A {} > > A x = (A)object; // Emits SubTypeCheckNode(object, A), but could have used X instead of A. > > However, the `CheckCastPP` result already uses the improved instance type ptr `X` (i.e. `toop` which was improved from `A` by calling `try_improve()` to get the unique concrete sub class): > https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3257-L3261 > https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3363 > > We should also plug in a unique concrete sub class constant in the `SubTypeCheckNode` which could be beneficial to fold away redundant sub type checks (see test cases). > > This fix is required to completely fix the bug in Valhalla (this is only one of the broken cases). In Valhalla, the graph ends up being broken because a `CheckCastPP` node is folded because of an impossible type but the `SubTypeCheckNode` is not due to not using the improved unique concrete sub class constant for the `checkcast`. I don't think that there is currently a bug in mainline because of this limitation - it just blocks some optimizations. I'm therefore upstreaming this fix to mainline since it can be beneficial to have this fix here as well (see test cases). > > Thanks, > Christian Looks good in principle. src/hotspot/share/opto/graphKit.cpp line 3362: > 3360: // Generate the subtype check > 3361: Node* improved_superklass = superklass; > 3362: if (improved_klass_ptr_type != klass_ptr_type && improved_klass_ptr_type->singleton()) { In what case can there be an `improved_klass_ptr_type` that's not a constant? ------------- PR Review: https://git.openjdk.org/jdk/pull/18515#pullrequestreview-1963830783 PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1541423188 From iveresov at openjdk.org Wed Mar 27 16:21:22 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 27 Mar 2024 16:21:22 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:59:51 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Fix indentation Regarding the solution. You seem to be tapping into the `is_method_profiled()` functionality and applying the threshold rule meant for methods with MDOs, which is a bit hacky. How about a more straightforward way: diff --git a/src/hotspot/share/compiler/compilationPolicy.cpp b/src/hotspot/share/compiler/compilationPolicy.cpp index d61de7cc866..57173ed621c 100644 --- a/src/hotspot/share/compiler/compilationPolicy.cpp +++ b/src/hotspot/share/compiler/compilationPolicy.cpp @@ -1026,7 +1026,7 @@ CompLevel CompilationPolicy::common(const methodHandle& method, CompLevel cur_le if (force_comp_at_level_simple(method)) { next_level = CompLevel_simple; } else { - if (is_trivial(method)) { + if (is_trivial(method) || method->is_native()) { next_level = CompilationModeFlag::disable_intermediate() ? CompLevel_full_optimization : CompLevel_simple; } else { switch(cur_level) { What do you think? Would that work? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18496#issuecomment-2023178005 From duke at openjdk.org Wed Mar 27 16:27:55 2024 From: duke at openjdk.org (Joshua Cao) Date: Wed, 27 Mar 2024 16:27:55 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit [v2] In-Reply-To: References: Message-ID: > The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. > > This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. > > I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. > > I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). > > Passes hotspot tier1 locally on a Linux machine. > > ### Benchmarks > > Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. > > Baseline: > > Result "org.renaissance.jdk.streams.JmhParMnemonics.run": > N = 25 > mean = 3309.611 ?(99.9%) 86.699 ms/op > > Histogram, ms/op: > [3000.000, 3050.000) = 0 > [3050.000, 3100.000) = 4 > [3100.000, 3150.000) = 1 > [3150.000, 3200.000) = 0 > [3200.000, 3250.000) = 0 > [3250.000, 3300.000) = 0 > [3300.000, 3350.000) = 9 > [3350.000, 3400.000) = 6 > [3400.000, 3450.000) = 5 > > Percentiles, ms/op: > p(0.0000) = 3069.910 ms/op > p(50.0000) = 3348.140 ms/op > ... Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'master' into storestore - Remove unused imports - Handle remaining cases of eliminating StoreStore for escaped objs - Add tests for barriers in constructors - Replace all end of ctor MemBarRelease with MemBarStoreStore - Compute redundancy for StoreStore - 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18505/files - new: https://git.openjdk.org/jdk/pull/18505/files/7cbe49bb..950864da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18505&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18505&range=00-01 Stats: 66751 lines in 2563 files changed: 14523 ins; 9612 del; 42616 mod Patch: https://git.openjdk.org/jdk/pull/18505.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18505/head:pull/18505 PR: https://git.openjdk.org/jdk/pull/18505 From duke at openjdk.org Wed Mar 27 16:31:49 2024 From: duke at openjdk.org (Joshua Cao) Date: Wed, 27 Mar 2024 16:31:49 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit [v3] In-Reply-To: References: Message-ID: > The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. > > This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. > > I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. > > I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). > > Passes hotspot tier1 locally on a Linux machine. > > ### Benchmarks > > Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. > > Baseline: > > Result "org.renaissance.jdk.streams.JmhParMnemonics.run": > N = 25 > mean = 3309.611 ?(99.9%) 86.699 ms/op > > Histogram, ms/op: > [3000.000, 3050.000) = 0 > [3050.000, 3100.000) = 4 > [3100.000, 3150.000) = 1 > [3150.000, 3200.000) = 0 > [3200.000, 3250.000) = 0 > [3250.000, 3300.000) = 0 > [3300.000, 3350.000) = 9 > [3350.000, 3400.000) = 6 > [3400.000, 3450.000) = 5 > > Percentiles, ms/op: > p(0.0000) = 3069.910 ms/op > p(50.0000) = 3348.140 ms/op > ... Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: EA tests, static test classes, add @requires, fix comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18505/files - new: https://git.openjdk.org/jdk/pull/18505/files/950864da..3f03f31e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18505&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18505&range=01-02 Stats: 30 lines in 2 files changed: 26 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18505.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18505/head:pull/18505 PR: https://git.openjdk.org/jdk/pull/18505 From chagedorn at openjdk.org Wed Mar 27 16:32:22 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 16:32:22 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded In-Reply-To: References: Message-ID: <22gjjbpqMeDYjF-V6mJnrdh71N6lJQ2O8lw8s6mxyxk=.1b307c29-a1be-48fd-9f3c-a35ad652f8a2@github.com> On Wed, 27 Mar 2024 16:05:12 GMT, Roland Westrelin wrote: > > I'm not sure about how much benefit it gives us. A quick JBS search for "ExpandSubTypeCheckAtParseTime" revealed a few issues - but would need to double check how many of them really only triggered with that flag and were real bugs. So, apart from having it as a stress option, I don't see a real benefit for it - but that might be a good enough reason to keep it for now. > > What do you think? > > It also has a maintenance cost (you had to make a code change for it in this PR and I also remember having to take `ExpandSubTypeCheckAtParseTime` into consideration at some point). I would vote for removing it unless it's known to have some value. You're right, it indeed has a maintenance cost (as proved again here) which I'm also not so sure of whether it's worth. Removing the flag sounds reasonable from that standpoint. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18512#issuecomment-2023217154 From chagedorn at openjdk.org Wed Mar 27 16:32:23 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 16:32:23 GMT Subject: RFR: 8328702: C2: Crash during parsing because sub type check is not folded In-Reply-To: References: <4ns0K8FAgoDXONbLGjVKsLj1SR5RoFXm9ZdKbrPcGc0=.abbd2b58-97a0-489c-9beb-046318685a65@github.com> Message-ID: On Wed, 27 Mar 2024 16:05:38 GMT, Roland Westrelin wrote: >> That would definitely be better but I've seen that there are three other uses of `intersection_with` + `eq`. We should probably update them all together but not sure if I should squeeze this in here. Should I follow up with an RFE? > > That sounds good to me. Filed [JDK-8329201](https://bugs.openjdk.org/browse/JDK-8329201) - will do a follow up PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18512#discussion_r1541449605 From chagedorn at openjdk.org Wed Mar 27 16:42:30 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 27 Mar 2024 16:42:30 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class In-Reply-To: References: Message-ID: <0y-TzdF1qxDiFz2ZyTYon6GVd_F5wN_bPdi4Evt8CPg=.b390ea96-81e5-49a4-8d21-9d983a7fb892@github.com> On Wed, 27 Mar 2024 16:15:08 GMT, Roland Westrelin wrote: >> While working on a [Valhalla bug](https://bugs.openjdk.org/browse/JDK-8321734), I've noticed that a `SubTypeCheckNode` for a `checkcast` does not take a unique concrete sub class `X` of an abstract class `A` as klass constant in the sub type check. Instead, it uses the abstract klass constant: >> >> >> abstract class A {} >> class X extends A {} >> >> A x = (A)object; // Emits SubTypeCheckNode(object, A), but could have used X instead of A. >> >> However, the `CheckCastPP` result already uses the improved instance type ptr `X` (i.e. `toop` which was improved from `A` by calling `try_improve()` to get the unique concrete sub class): >> https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3257-L3261 >> https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3363 >> >> We should also plug in a unique concrete sub class constant in the `SubTypeCheckNode` which could be beneficial to fold away redundant sub type checks (see test cases). >> >> This fix is required to completely fix the bug in Valhalla (this is only one of the broken cases). In Valhalla, the graph ends up being broken because a `CheckCastPP` node is folded because of an impossible type but the `SubTypeCheckNode` is not due to not using the improved unique concrete sub class constant for the `checkcast`. I don't think that there is currently a bug in mainline because of this limitation - it just blocks some optimizations. I'm therefore upstreaming this fix to mainline since it can be beneficial to have this fix here as well (see test cases). >> >> Thanks, >> Christian > > src/hotspot/share/opto/graphKit.cpp line 3362: > >> 3360: // Generate the subtype check >> 3361: Node* improved_superklass = superklass; >> 3362: if (improved_klass_ptr_type != klass_ptr_type && improved_klass_ptr_type->singleton()) { > > In what case can there be an `improved_klass_ptr_type` that's not a constant? I was surprised by that as well when I've run testing but `gen_checkcast()` is actually also called in `Parse::array_store_check()` which passes a `LoadKlassNode` which is not a constant: https://github.com/openjdk/jdk/blob/05854fd704cba6ebd73007d9547a064891d49587/src/hotspot/share/opto/parseHelper.cpp#L233-L238 It states that it ignores the result and just calls it for the CFG effects - does not sound like a very clean solution. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1541468186 From epeter at openjdk.org Wed Mar 27 16:46:24 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 16:46:24 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 11:34:56 GMT, Christian Hagedorn wrote: >> This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. >> >> #### How `create_bool_from_template_assertion_predicate()` Works >> Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: >> 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): >> https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 >> >> 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. >> >> #### Missing Visited Set >> The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: >> >> >> ... >> | >> E >> | >> D >> / \ >> B C >> \ / >> A >> >> DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... >> >> With each diamond, the number of revisits of each node above doubles. >> >> #### Endless DFS in Edge-Cases >> In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because... > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Change from DFS to 2xBFS > - Review Emanuel first part > - Merge branch 'master' into JDK-8327110 > - 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix pure cloning cases used for Loop Unswitching and Split If The changes look good, just a few minor suggestions left. src/hotspot/share/opto/predicates.cpp line 193: > 191: // trivially pass the _node_filter. > 192: NodeCheck _is_target_node; > 193: Unique_Node_List _collected_nodes; // The resulting node collection of all nodes on paths from source->target(s). Suggestion: // The resulting node collection of all nodes on paths from source->target(s). Unique_Node_List _collected_nodes; Just cosmetic. All other comments are on their own lines too. src/hotspot/share/opto/predicates.cpp line 217: > 215: backtrack_from_target_nodes(); > 216: assert(_collected_nodes.member(start_node), "must find start node again when backtracking"); > 217: } What scenario is there where we collect no nodes? Probably there is some, where we don't have to clone anything.. but it's a bit strange. Would be nice to have a quick comment here about that. For simplicity, you could always call backtrack (it just does nothing anyway). Then you can just make the assert a bit smarter: `assert(_collected_nodes.size() == 0 || _collected_nodes.member(start_node), "must find start node again when backtracking");` test/hotspot/jtreg/compiler/predicates/TestCloningWithManyDiamondsInExpression.java line 28: > 26: * @test > 27: * @bug 8327110 > 28: * @requires vm.compiler2.enabled Is this required? Maybe we can move the ` * @run main compiler.predicates.TestCloningWithManyDiamondsInExpression` run to a new `@test` block that does not have this `@requires`? Then other compilers also get tested. ------------- PR Review: https://git.openjdk.org/jdk/pull/18293#pullrequestreview-1963786755 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1541397549 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1541407719 PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1541420965 From roland at openjdk.org Wed Mar 27 16:49:22 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 27 Mar 2024 16:49:22 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class In-Reply-To: <0y-TzdF1qxDiFz2ZyTYon6GVd_F5wN_bPdi4Evt8CPg=.b390ea96-81e5-49a4-8d21-9d983a7fb892@github.com> References: <0y-TzdF1qxDiFz2ZyTYon6GVd_F5wN_bPdi4Evt8CPg=.b390ea96-81e5-49a4-8d21-9d983a7fb892@github.com> Message-ID: On Wed, 27 Mar 2024 16:39:32 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/graphKit.cpp line 3362: >> >>> 3360: // Generate the subtype check >>> 3361: Node* improved_superklass = superklass; >>> 3362: if (improved_klass_ptr_type != klass_ptr_type && improved_klass_ptr_type->singleton()) { >> >> In what case can there be an `improved_klass_ptr_type` that's not a constant? > > I was surprised by that as well when I've run testing but `gen_checkcast()` is actually also called in `Parse::array_store_check()` which passes a `LoadKlassNode` which is not a constant: > > https://github.com/openjdk/jdk/blob/05854fd704cba6ebd73007d9547a064891d49587/src/hotspot/share/opto/parseHelper.cpp#L233-L238 > > It states that it ignores the result and just calls it for the CFG effects - does not sound like a very clean solution. I'm confused by this. Looking at `TypeInstKlassPtr::try_improve()` returns a constant only if it finds a unique subklass and that subklass is final. So it should not be uncommon to have an `improved_klass_ptr_type` that's not constant. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1541480833 From never at openjdk.org Wed Mar 27 17:12:29 2024 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 27 Mar 2024 17:12:29 GMT Subject: RFR: 8329191: JVMCI compiler warning is truncated In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 14:00:41 GMT, Doug Simon wrote: > $ java -Djdk.graal.CompilerConfiguration=XXcommunity HelloWorld > [0.035s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterp > > The above message is truncated. It should be: > > [0.032s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterprise, community, economy]"). Use -Xlog:jit+compilation for more detail. > > > This PR fixes this by using `stringStream` instead of `err_msg` when creating these messages. In jvmciEnv.cpp were use err_msg with _init_error_msg in two places. Should those also use stringStream? All the other uses of err_msg look like they should be short enough to use the preallocated buffer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18513#issuecomment-2023336838 From epeter at openjdk.org Wed Mar 27 17:12:27 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 17:12:27 GMT Subject: RFR: 8326421: Add jtreg test for large arrayCopy disjoint case. In-Reply-To: References: Message-ID: <_0V4aLv23eyNBgwgzFThGCfXPQw6jTZa2me6ZnF6I_g=.83cd138c-fc3c-4b01-9ccd-10ff7f4bf5d7@github.com> On Thu, 22 Feb 2024 13:01:50 GMT, Swati Sharma wrote: > Hi All, > > Added a new jtreg test case for large arrayCopy disjoint case. > This will test byte array copy operation for aligned and non aligned cases with array length greater than 2.5MB. > > Please review and provide your feedback. > > Thanks, > Swati > Intel Thanks for writing this test, and validating the results! I left a few suggestions for improvement. test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyDisjointLarge.java line 25: > 23: > 24: package compiler.arraycopy; > 25: import java.util.Random; You don't seem to be using Random at all. But maybe you should ;) test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyDisjointLarge.java line 29: > 27: /** > 28: * @test > 29: * @bug 8310159 Suggestion: * @bug 8326421 Was there a reason for the other bug number? I think usually we use the bug number of the issue where the test is added. I might be wrong. test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyDisjointLarge.java line 32: > 30: * @summary Test large arrayCopy. > 31: * > 32: * @run main/othervm/timeout=600 -XX:-TieredCompilation -Xbatch compiler.arraycopy.TestArrayCopyDisjointLarge Suggestion: * @run main/timeout=600 compiler.arraycopy.TestArrayCopyDisjointLarge test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyDisjointLarge.java line 82: > 80: testByte(lengths[i % lengths.length], 9, 0); > 81: testByte(lengths[i % lengths.length], 9, 9); > 82: } Why not randomize the values a bit, rather than chosing from a few fixed ones? Or at least add some random values. This would give better coverage. ------------- PR Review: https://git.openjdk.org/jdk/pull/17962#pullrequestreview-1963948224 PR Review Comment: https://git.openjdk.org/jdk/pull/17962#discussion_r1541513469 PR Review Comment: https://git.openjdk.org/jdk/pull/17962#discussion_r1541503189 PR Review Comment: https://git.openjdk.org/jdk/pull/17962#discussion_r1541503607 PR Review Comment: https://git.openjdk.org/jdk/pull/17962#discussion_r1541508187 From epeter at openjdk.org Wed Mar 27 17:12:29 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 27 Mar 2024 17:12:29 GMT Subject: RFR: 8326421: Add jtreg test for large arrayCopy disjoint case. In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 16:49:36 GMT, Vladimir Kozlov wrote: >> Hi All, >> >> Added a new jtreg test case for large arrayCopy disjoint case. >> This will test byte array copy operation for aligned and non aligned cases with array length greater than 2.5MB. >> >> Please review and provide your feedback. >> >> Thanks, >> Swati >> Intel > > test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyDisjointLarge.java line 32: > >> 30: * @summary Test large arrayCopy. >> 31: * >> 32: * @run main/othervm/timeout=600 -XX:-TieredCompilation -Xbatch compiler.arraycopy.TestArrayCopyDisjointLarge > > What was the reason to use these 2 flags `-XX:-TieredCompilation -Xbatch`? I would suggest that you remove the two flags, and then you can also remove the `othervm`. It will still be a `main` test, and the flags can be passed in from the outside. We do that in our testing infrastructure, i.e. run it with different sets of flags. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17962#discussion_r1541500695 From kvn at openjdk.org Wed Mar 27 17:36:23 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 17:36:23 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 16:31:49 GMT, Joshua Cao wrote: >> The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. >> >> This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. >> >> I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. >> >> I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). >> >> Passes hotspot tier1 locally on a Linux machine. >> >> ### Benchmarks >> >> Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. >> >> Baseline: >> >> Result "org.renaissance.jdk.streams.JmhParMnemonics.run": >> N = 25 >> mean = 3309.611 ?(99.9%) 86.699 ms/op >> >> Histogram, ms/op: >> [3000.000, 3050.000) = 0 >> [3050.000, 3100.000) = 4 >> [3100.000, 3150.000) = 1 >> [3150.000, 3200.000) = 0 >> [3200.000, 3250.000) = 0 >> [3250.000, 3300.000) = 0 >> [3300.000, 3350.000) = 9 >> [3350.000, 3400.000) = 6 >> [3400.000, 3450.000) = 5 >> >> Percentiles, ms/op: >> p(0... > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > EA tests, static test classes, add @requires, fix comment What about `MemBarRelease` in `PhaseStringOpts::replace_string_concat()`? ------------- PR Review: https://git.openjdk.org/jdk/pull/18505#pullrequestreview-1964030662 From kvn at openjdk.org Wed Mar 27 17:48:22 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 17:48:22 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 16:31:49 GMT, Joshua Cao wrote: >> The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. >> >> This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. >> >> I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. >> >> I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). >> >> Passes hotspot tier1 locally on a Linux machine. >> >> ### Benchmarks >> >> Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. >> >> Baseline: >> >> Result "org.renaissance.jdk.streams.JmhParMnemonics.run": >> N = 25 >> mean = 3309.611 ?(99.9%) 86.699 ms/op >> >> Histogram, ms/op: >> [3000.000, 3050.000) = 0 >> [3050.000, 3100.000) = 4 >> [3100.000, 3150.000) = 1 >> [3150.000, 3200.000) = 0 >> [3200.000, 3250.000) = 0 >> [3250.000, 3300.000) = 0 >> [3300.000, 3350.000) = 9 >> [3350.000, 3400.000) = 6 >> [3400.000, 3450.000) = 5 >> >> Percentiles, ms/op: >> p(0... > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > EA tests, static test classes, add @requires, fix comment Can we also add statistic about how many different barriers C2 generates and eliminates? It will help to know if we missing some optimization with these changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18505#issuecomment-2023411270 From dnsimon at openjdk.org Wed Mar 27 18:38:47 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 27 Mar 2024 18:38:47 GMT Subject: RFR: 8329191: JVMCI compiler warning is truncated [v2] In-Reply-To: References: Message-ID: > $ java -Djdk.graal.CompilerConfiguration=XXcommunity HelloWorld > [0.035s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterp > > The above message is truncated. It should be: > > [0.032s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterprise, community, economy]"). Use -Xlog:jit+compilation for more detail. > > > This PR fixes this by using `stringStream` instead of `err_msg` when creating these messages. Doug Simon has updated the pull request incrementally with one additional commit since the last revision: converted more usages of err_msg to stringStream ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18513/files - new: https://git.openjdk.org/jdk/pull/18513/files/5f23d0e7..61d6486d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18513&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18513&range=00-01 Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18513.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18513/head:pull/18513 PR: https://git.openjdk.org/jdk/pull/18513 From kvn at openjdk.org Wed Mar 27 18:40:36 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 18:40:36 GMT Subject: RFR: 8325252: C2 SuperWord: refactor the packset [v6] In-Reply-To: References: Message-ID: <_L3hmdtdADTbuCwlKXT95wYjzWNTYIy2oFOwj-UM-do=.22b70639-bc96-40e8-9d6a-83050f0e9856@github.com> On Wed, 27 Mar 2024 15:48:50 GMT, Emanuel Peter wrote: >> I'm refactoring the packset, separating the details of packset-manupulation from the SuperWord algorithm. >> >> Most importantly: I split it into two classes: `PairSet` and `PackSet`. >> `combine_pairs_to_longer_packs` converts the first into the second. >> >> I was able to simplify the combining, and remove the pack-sorting. >> I now walk "pair-chains" directly with `PairSetIterator`. One such pair-chain is equivalent to a pack. >> >> I moved all the `filter / split` functionality to the `PackSet`, which allows hiding a lot of packset-manipulation from the SuperWord algorithm. >> >> I ran into some issues when I was extending the pairset in `extend_pairset_with_more_pairs_by_following_use_and_def`: >> Using the PairSetIterator changed the order of extension, and that messed with the packing heuristic, and quite a few examples did not vectorize, because we would pack up the wrong 2 nodes out of a choice of 4 (e.g. we would pack `ac bd` instead of `ab cd`). Hence, I now still have to keep the insertion order for the pairs, and this basically means we are extending with a BFS order. Maybe this issue can be removed, if I improve the packing heuristic with some look-ahead expansion approach (but that is for another day [JDK-8309908](https://bugs.openjdk.org/browse/JDK-8309908)). >> >> But since I already spent some time on some of the packing heuristic (reordering and cost estimate), I did a light refactoring, and added extra tests for MulAddS2I. >> >> More details are described in the annotations in the code. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > more updates for Christian, batch 2 Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18276#pullrequestreview-1964396355 From simonis at openjdk.org Wed Mar 27 19:15:21 2024 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 27 Mar 2024 19:15:21 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 16:18:37 GMT, Igor Veresov wrote: > Regarding the solution. You seem to be tapping into the `is_method_profiled()` functionality and applying the threshold rule meant for methods with MDOs, which is a bit hacky. How about a more straightforward way: > > ``` > diff --git a/src/hotspot/share/compiler/compilationPolicy.cpp b/src/hotspot/share/compiler/compilationPolicy.cpp > index d61de7cc866..57173ed621c 100644 > --- a/src/hotspot/share/compiler/compilationPolicy.cpp > +++ b/src/hotspot/share/compiler/compilationPolicy.cpp > @@ -1026,7 +1026,7 @@ CompLevel CompilationPolicy::common(const methodHandle& method, CompLevel cur_le > if (force_comp_at_level_simple(method)) { > next_level = CompLevel_simple; > } else { > - if (is_trivial(method)) { > + if (is_trivial(method) || method->is_native()) { > next_level = CompilationModeFlag::disable_intermediate() ? CompLevel_full_optimization : CompLevel_simple; > } else { > switch(cur_level) { > ``` > > What do you think? Would that work? I think it will work, but wouldn't that instantly create a native wrapper for *every* native method? Shouldn't we only create native wrappers for hot native methods? > Have you found out why exact are the wrappers not created with `-XX:-TieredCompilation` ? I don't see any conditional with `is_native()` that would prevent that? The reason why they are not created is because native methods have no MDO so we always bail out here: case CompLevel_full_profile: { MethodData* mdo = method->method_data(); if (mdo != nullptr) { if (mdo->would_profile() || CompilationModeFlag::disable_intermediate()) { int mdo_i = mdo->invocation_count_delta(); int mdo_b = mdo->backedge_count_delta(); if (Predicate::apply(method, cur_level, mdo_i, mdo_b)) { next_level = CompLevel_full_optimization; } } else { next_level = CompLevel_full_optimization; } } ------------- PR Comment: https://git.openjdk.org/jdk/pull/18496#issuecomment-2023771418 PR Comment: https://git.openjdk.org/jdk/pull/18496#issuecomment-2023777728 From iveresov at openjdk.org Wed Mar 27 20:08:42 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 27 Mar 2024 20:08:42 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 19:10:58 GMT, Volker Simonis wrote: > I think it will work, but wouldn't that instantly create a native wrapper for *every* native method? Shouldn't we only create native wrappers for hot native methods? To get here it will have to be already relatively hot. And since we don't need to profile it I think it would be fair to treat it exactly the same way we treat trivial methods. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18496#issuecomment-2023840629 From never at openjdk.org Wed Mar 27 20:08:44 2024 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 27 Mar 2024 20:08:44 GMT Subject: RFR: 8329191: JVMCI compiler warning is truncated [v2] In-Reply-To: References: Message-ID: <3NZUiuCdICcF71jqezfOLHPIZWIlBwp0UjOZNiTm--8=.22addd64-df8e-4621-9b28-76a75fa6116d@github.com> On Wed, 27 Mar 2024 18:38:47 GMT, Doug Simon wrote: >> $ java -Djdk.graal.CompilerConfiguration=XXcommunity HelloWorld >> [0.035s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterp >> >> The above message is truncated. It should be: >> >> [0.032s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterprise, community, economy]"). Use -Xlog:jit+compilation for more detail. >> >> >> This PR fixes this by using `stringStream` instead of `err_msg` when creating these messages. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > converted more usages of err_msg to stringStream looks good. ------------- Marked as reviewed by never (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18513#pullrequestreview-1964610928 From kvn at openjdk.org Wed Mar 27 22:01:36 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Mar 2024 22:01:36 GMT Subject: Integrated: 8328986: Deprecate UseRTM* flags for removal In-Reply-To: References: Message-ID: On Mon, 25 Mar 2024 20:41:20 GMT, Vladimir Kozlov wrote: > HotSpot supports RTM (restricted transactional memory) for locking since JDK 8 on Intel's processors ([JDK-8031320](https://bugs.openjdk.org/browse/JDK-8031320)). It was added to other platforms but has since been disabled and removed on all but Intel's processors. There was attempt to deprecate it ([JDK-8292082](https://bugs.openjdk.org/browse/JDK-8292082)) during JDK 20 development but at that time it was decided to keep it. Recently we discussed this with Intel and they agreed with RTM deprecation and removal from HotSpot. > > RTM adds complexity and maintenance burden to HotSpot locking code. It was never enabled by default because it only helped in some cases of heavy lock contention. We are not testing RTM feature since JDK 14 when we problem-list related tests: [JDK-8226899](https://bugs.openjdk.org/browse/JDK-8226899). > > New LIGHTWEIGHT locking implementation will not support RTM locking: [JDK-8320321](https://bugs.openjdk.org/browse/JDK-8320321). > > I propose to deprecate the related flags and remove the flags and all related code in a later release. > > Changes are based on @rkennke changes for JDK 20 [#9810](https://github.com/openjdk/jdk/pull/9810) > > Testing: tier1 This pull request has now been integrated. Changeset: 3eb1d05d Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/3eb1d05d853e92949bf239ac4b88436a4fe0997d Stats: 173 lines in 4 files changed: 92 ins; 77 del; 4 mod 8328986: Deprecate UseRTM* flags for removal Co-authored-by: Roman Kennke Reviewed-by: vlivanov, sviswanathan, dholmes ------------- PR: https://git.openjdk.org/jdk/pull/18478 From liach at openjdk.org Wed Mar 27 22:31:31 2024 From: liach at openjdk.org (Chen Liang) Date: Wed, 27 Mar 2024 22:31:31 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 16:31:49 GMT, Joshua Cao wrote: >> The [JSR 133 cookbook](https://gee.cs.oswego.edu/dl/jmm/cookbook.html) has long recommended using a `StoreStore` barrier at the end of constructors that write to final fields. `StoreStore` barriers are much cheaper on arm machines as shown in benchmarks in this issue as well as https://bugs.openjdk.org/browse/JDK-8324186. >> >> This change does not improve the case for constructors for objects with volatile fields because [MemBarRelease is emitted for volatile stores](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L211). This is demonstrated in test case `classWithVolatile`, where this patch does not impact the IR. >> >> I had to modify some code around escape analysis to make sure there are no regressions in eliminating allocations and `StoreStore`'s. The [current handling of StoreStore's in escape analysis](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/escape.cpp#L2590) makes the assumption that the barriers input is a `Proj` to an `Allocate` ([example](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/library_call.cpp#L1553)). This is contrary to the barriers in the end of the constructor where there the barrier directly takes in an `Allocate` without an in between `Proj`. I opted to instead eliminate `StoreStore`s in GVN, exactly how `MemBarRelease` is handled. >> >> I had to add [checks for StoreStore in macro.cpp](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/src/hotspot/share/opto/macro.cpp#L636), or else we fail some [cases for reducing allocation merges](https://github.com/openjdk/jdk/blob/8fc9097b3720314ef7efaf1f3ac31898c8d6ca19/test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java#L1233-L1256). >> >> Passes hotspot tier1 locally on a Linux machine. >> >> ### Benchmarks >> >> Running Renaissance ParNnemonics on an Amazon Graviton (arm) instance. >> >> Baseline: >> >> Result "org.renaissance.jdk.streams.JmhParMnemonics.run": >> N = 25 >> mean = 3309.611 ?(99.9%) 86.699 ms/op >> >> Histogram, ms/op: >> [3000.000, 3050.000) = 0 >> [3050.000, 3100.000) = 4 >> [3100.000, 3150.000) = 1 >> [3150.000, 3200.000) = 0 >> [3200.000, 3250.000) = 0 >> [3250.000, 3300.000) = 0 >> [3300.000, 3350.000) = 9 >> [3350.000, 3400.000) = 6 >> [3400.000, 3450.000) = 5 >> >> Percentiles, ms/op: >> p(0... > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > EA tests, static test classes, add @requires, fix comment I heard rumors that storeStore is only safe for the scenarios where the constructor doesn't read its already assigned final fields; so if we have something like class Sample { final int a, b; Sample(int v) { this.a = v; this.b = this.a + 1; // performs read instance field before publication } } then we still need a regular release barrier. Am I correct here? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18505#issuecomment-2024094598 From duke at openjdk.org Wed Mar 27 22:37:44 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 27 Mar 2024 22:37:44 GMT Subject: RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v2] In-Reply-To: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> References: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> Message-ID: > The goal of this PR is improve the performance of convert instructions and address the slowdown when AVX>0 is used. > > The performance data using the ComputePI.java benchmark (part of this PR) is as follows: > xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > > > > > >
> > > Benchmark (ns/op) | Stock JDK | This PR (AVX=3) | Speedup > -- | -- | -- | -- > ComputePI.compute_pi_dbl_flt | 511.34 | 511.226 | 1.0 > ComputePI.compute_pi_flt_dbl | 2024.06 | 541.544 | 3.7 > ComputePI.compute_pi_int_dbl | 695.482 | 506.546 | 1.4 > ComputePI.compute_pi_int_flt | 799.268 | 450.298 | 1.8 > ComputePI.compute_pi_long_dbl | 802.992 | 577.984 | 1.4 > ComputePI.compute_pi_long_flt | 628.62 | 549.057 | 1.1 > > > >
> > > > > > > xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > > > > > >
> > > Benchmark (ns/op) | Stock JDK | This PR (AVX=0) | Speedup > -- | -- | -- | -- > ComputePI.compute_pi_dbl_flt | 473.778 | 472.529 | 1.0 > ComputePI.compute_pi_flt_dbl | 536.004 | 538.418 | 1.0 > ComputePI.compute_pi_int_dbl | 458.08 | 460.245 | 1.0 > ComputePI.compute_pi_int_flt | 477.305 | 476.975 | 1.0 > ComputePI.compute_pi_long_dbl | 455.132 | 455.064 | 1.0 > ComputePI.compute_pi_long_flt | 474.734 | 476.571 | 1.0 > > > >
> > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with three additional commits since the last revision: - Update ComputePI.java - Update ComputePI.java - Update TestConvertImplicitNullCheck.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18503/files - new: https://git.openjdk.org/jdk/pull/18503/files/556f3bb4..1b17e844 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18503&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18503&range=00-01 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18503.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18503/head:pull/18503 PR: https://git.openjdk.org/jdk/pull/18503 From duke at openjdk.org Wed Mar 27 23:52:47 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 27 Mar 2024 23:52:47 GMT Subject: RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v3] In-Reply-To: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> References: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> Message-ID: > The goal of this PR is improve the performance of convert instructions and address the slowdown when AVX>0 is used. > > The performance data using the ComputePI.java benchmark (part of this PR) is as follows: > xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > > > > > >
> > > Benchmark (ns/op) | Stock JDK | This PR (AVX=3) | Speedup > -- | -- | -- | -- > ComputePI.compute_pi_dbl_flt | 511.34 | 511.226 | 1.0 > ComputePI.compute_pi_flt_dbl | 2024.06 | 541.544 | 3.7 > ComputePI.compute_pi_int_dbl | 695.482 | 506.546 | 1.4 > ComputePI.compute_pi_int_flt | 799.268 | 450.298 | 1.8 > ComputePI.compute_pi_long_dbl | 802.992 | 577.984 | 1.4 > ComputePI.compute_pi_long_flt | 628.62 | 549.057 | 1.1 > > > >
> > > > > > > xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > > > > > >
> > > Benchmark (ns/op) | Stock JDK | This PR (AVX=0) | Speedup > -- | -- | -- | -- > ComputePI.compute_pi_dbl_flt | 473.778 | 472.529 | 1.0 > ComputePI.compute_pi_flt_dbl | 536.004 | 538.418 | 1.0 > ComputePI.compute_pi_int_dbl | 458.08 | 460.245 | 1.0 > ComputePI.compute_pi_int_flt | 477.305 | 476.975 | 1.0 > ComputePI.compute_pi_long_dbl | 455.132 | 455.064 | 1.0 > ComputePI.compute_pi_long_flt | 474.734 | 476.571 | 1.0 > > > >
> > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: fix cause for failing PowTests.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18503/files - new: https://git.openjdk.org/jdk/pull/18503/files/1b17e844..fad7180e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18503&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18503&range=01-02 Stats: 16 lines in 2 files changed: 12 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18503.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18503/head:pull/18503 PR: https://git.openjdk.org/jdk/pull/18503 From duke at openjdk.org Thu Mar 28 00:45:33 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 28 Mar 2024 00:45:33 GMT Subject: RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v4] In-Reply-To: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> References: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> Message-ID: <-niLtlGA1SlJaGnF7lPakY0eXX1njpTFpuTuWslhMMc=.846d48d4-74ee-43cf-9b95-9665f5fe0816@github.com> > The goal of this PR is improve the performance of convert instructions and address the slowdown when AVX>0 is used. > > The performance data using the ComputePI.java benchmark (part of this PR) is as follows: > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > Benchmark (ns/op) | Stock JDK | This PR (AVX=3) | Speedup > -- | -- | -- | -- > ComputePI.compute_pi_dbl_flt | 511.34 | 510.989 | 1.0 > ComputePI.compute_pi_flt_dbl | 2024.06 | 518.695 | 3.9 > ComputePI.compute_pi_int_dbl | 695.482 | 453.054 | 1.5 > ComputePI.compute_pi_int_flt | 799.268 | 449.83 | 1.8 > ComputePI.compute_pi_long_dbl | 802.992 | 454.891 | 1.8 > ComputePI.compute_pi_long_flt | 628.62 | 627.725 | 1.0 > > > > > > > > xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > > > > > >
> > > Benchmark (ns/op) | Stock JDK | This PR (AVX=0) | Speedup > -- | -- | -- | -- > ComputePI.compute_pi_dbl_flt | 473.778 | 472.529 | 1.0 > ComputePI.compute_pi_flt_dbl | 536.004 | 538.418 | 1.0 > ComputePI.compute_pi_int_dbl | 458.08 | 460.245 | 1.0 > ComputePI.compute_pi_int_flt | 477.305 | 476.975 | 1.0 > ComputePI.compute_pi_long_dbl | 455.132 | 455.064 | 1.0 > ComputePI.compute_pi_long_flt | 474.734 | 476.571 | 1.0 > > > >
> > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: fix L2F cvtsi2ssq ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18503/files - new: https://git.openjdk.org/jdk/pull/18503/files/fad7180e..970716f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18503&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18503&range=02-03 Stats: 6 lines in 1 file changed: 3 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18503.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18503/head:pull/18503 PR: https://git.openjdk.org/jdk/pull/18503 From dlong at openjdk.org Thu Mar 28 02:45:36 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 28 Mar 2024 02:45:36 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v7] In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 20:45:45 GMT, Dean Long wrote: >>> I don't think target-specific logic belongs here. And I don't understand the point about Phi nodes. Isn't the holder_known flag enough? >> >> In my testing `holder_known` was not enough to detect objects that are not Phi. For example: >> >> >> static int[] test(int[] ints) >> { >> return ints.clone(); >> } >> >> >> `holder_known` is false when it tries to C1 compile `ints.clone()`, am I missing something here? >> >>> For primitive arrays, isn't it true that inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone? >> >> Possibly, but in this part of the logic I'm trying to find situations in which I don't want to apply the `clone` intrinsic. And those situations are non-array objects, and for arrays, those whose elements are not primitives. I don't see how I can craft such a condition with only `inline_target->get_Method()->intrinsic_id() == vmIntrinsics::_clone`? IOW, that condition might be true for primitive arrays, but is it false for non-array objects and non-primitive arrays? > > You're right about holder_known, but why do you need to check for _clone specifically at line 2137? If there is logic missing that prevents an inlining attempt then I think it should be fixed first, rather than in a followup. > > And I see that you need to do a receiver type check to allow only primitive arrays. Can you do that in append_alloc_array_copy, and bailout if not successful? The logic in build_graph_for_intrinsic would need to change slightly to support this. I was able to remove the clone-specific logic in invoke() in two parts: 1. fix the type_is_exact logic to allow array receiver 2. move primitive array receiver check into append_alloc_array_copy ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17667#discussion_r1542244166 From duke at openjdk.org Thu Mar 28 02:55:57 2024 From: duke at openjdk.org (lusou-zhangquan) Date: Thu, 28 Mar 2024 02:55:57 GMT Subject: RFR: 8329174: update CodeBuffer layout in comment after constants section moved Message-ID: Enhancement [JDK-6961697](https://bugs.openjdk.org/browse/JDK-6961697) moved nmethod constants section before instruction section, but the layout scheme in codeBuffer.cpp was not changed correspondingly. The mismatch between layout scheme in source code and actual layout is misleading, so we'd better fix it. ------------- Commit messages: - 8329174: update CodeBuffer layout in comment after constants section moved Changes: https://git.openjdk.org/jdk/pull/18529/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18529&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329174 Stats: 9 lines in 1 file changed: 4 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18529.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18529/head:pull/18529 PR: https://git.openjdk.org/jdk/pull/18529 From stuefe at openjdk.org Thu Mar 28 06:53:31 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 28 Mar 2024 06:53:31 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. @djelinski are you okay with this fix? If yes, mind hitting the green button? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18235#issuecomment-2024519830 From djelinski at openjdk.org Thu Mar 28 07:05:32 2024 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Thu, 28 Mar 2024 07:05:32 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. Sure. LGTM. ------------- Marked as reviewed by djelinski (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18235#pullrequestreview-1965317277 From stuefe at openjdk.org Thu Mar 28 07:12:42 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 28 Mar 2024 07:12:42 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 16:43:34 GMT, Vladimir Kozlov wrote: >> ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. >> >> According to POSIX, it should be valid to pass into setlocale output from setlocale. >> >> However, glibc seems to delete the old string when calling setlocale again: >> >> https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 >> >> Best to make a copy, and pass in the copy to setlocale. > > Looks good. Thanks @vnkozlov and @djelinski ------------- PR Comment: https://git.openjdk.org/jdk/pull/18235#issuecomment-2024545864 From stuefe at openjdk.org Thu Mar 28 07:12:42 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 28 Mar 2024 07:12:42 GMT Subject: Integrated: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. This pull request has now been integrated. Changeset: 47f33a59 Author: Thomas Stuefe URL: https://git.openjdk.org/jdk/commit/47f33a59eaaffc74881fcc9e29d13ff9b2538c2a Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod 8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm Reviewed-by: kvn, djelinski ------------- PR: https://git.openjdk.org/jdk/pull/18235 From duke at openjdk.org Thu Mar 28 08:47:31 2024 From: duke at openjdk.org (Fabio Niephaus) Date: Thu, 28 Mar 2024 08:47:31 GMT Subject: RFR: 8329191: JVMCI compiler warning is truncated [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 18:38:47 GMT, Doug Simon wrote: >> $ java -Djdk.graal.CompilerConfiguration=XXcommunity HelloWorld >> [0.035s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterp >> >> The above message is truncated. It should be: >> >> [0.032s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterprise, community, economy]"). Use -Xlog:jit+compilation for more detail. >> >> >> This PR fixes this by using `stringStream` instead of `err_msg` when creating these messages. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > converted more usages of err_msg to stringStream Marked as reviewed by fniephaus at github.com (no known OpenJDK username). ------------- PR Review: https://git.openjdk.org/jdk/pull/18513#pullrequestreview-1965498763 From roland at openjdk.org Thu Mar 28 09:21:41 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 09:21:41 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Mon, 25 Mar 2024 14:16:20 GMT, Roland Westrelin wrote: >>> > You can try to use `TestFramework::assertDeoptimizedByC2()` which skips the assertion for some unstable setups like having `PerMethodTrapLimit == 0`: >>> >>> Thanks for the suggestion! I used it to fix the test. @eme64 would you mind re-running tests? >> >> Minor detail: You should use `TestFramework::assertDeoptimizedByC2()` instead of `TestVM::assertDeoptimizedByC2()`. `TestVM` should only be called internally by the framework. > >> Minor detail: You should use `TestFramework::assertDeoptimizedByC2()` instead of `TestVM::assertDeoptimizedByC2()`. `TestVM` should only be called internally by the framework. > > Thanks for checking the change. I fixed it in the new commit. > @rwestrel Great, yes just launched it. Feel free to ask in a day or 2 if I don't report back by then! @eme64 any update on testing? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2024739351 From dlunden at openjdk.org Thu Mar 28 09:30:39 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 28 Mar 2024 09:30:39 GMT Subject: RFR: 8323682: C2: guard check is not generated in Arrays.copyOfRange intrinsic when allocation is eliminated by EA [v2] In-Reply-To: References: Message-ID: <72JHYYeCURk3GDEn5qo4PAjrwrhtlEuhQoJ5t9klxGA=.9e4d9a1c-ec00-4375-8feb-d66429336772@github.com> On Mon, 25 Mar 2024 14:40:21 GMT, Tobias Hartmann wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Readd negative end check > > That looks good to me. Thanks for the reviews @TobiHartmann and @vnkozlov. Please sponsor! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18472#issuecomment-2024755137 From epeter at openjdk.org Thu Mar 28 09:56:40 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 28 Mar 2024 09:56:40 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: <5QbsVmYi0tYGlOvDL4LjJb1SjChIZtaWSMthFM9grMI=.0900e1c3-90b3-4726-a7c6-c2aff49d07ce@github.com> On Thu, 28 Mar 2024 09:18:58 GMT, Roland Westrelin wrote: >>> Minor detail: You should use `TestFramework::assertDeoptimizedByC2()` instead of `TestVM::assertDeoptimizedByC2()`. `TestVM` should only be called internally by the framework. >> >> Thanks for checking the change. I fixed it in the new commit. > >> @rwestrel Great, yes just launched it. Feel free to ask in a day or 2 if I don't report back by then! > > @eme64 any update on testing? @rwestrel thanks for asking. About 10% seems to still be scheduled and have not completed, on `macosx-x64`. But the rest seems fine. I'll re-review next week :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2024804232 From dnsimon at openjdk.org Thu Mar 28 10:19:42 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 28 Mar 2024 10:19:42 GMT Subject: RFR: 8329191: JVMCI compiler warning is truncated [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 18:38:47 GMT, Doug Simon wrote: >> $ java -Djdk.graal.CompilerConfiguration=XXcommunity HelloWorld >> [0.035s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterp >> >> The above message is truncated. It should be: >> >> [0.032s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterprise, community, economy]"). Use -Xlog:jit+compilation for more detail. >> >> >> This PR fixes this by using `stringStream` instead of `err_msg` when creating these messages. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > converted more usages of err_msg to stringStream Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18513#issuecomment-2024843673 From dnsimon at openjdk.org Thu Mar 28 10:19:42 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 28 Mar 2024 10:19:42 GMT Subject: Integrated: 8329191: JVMCI compiler warning is truncated In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 14:00:41 GMT, Doug Simon wrote: > $ java -Djdk.graal.CompilerConfiguration=XXcommunity HelloWorld > [0.035s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterp > > The above message is truncated. It should be: > > [0.032s][warning][jit,compilation] JVMCI compiler disabled after 11 of 11 upcalls had errors (Last error: "uncaught exception in call_HotSpotJVMCIRuntime_compileMethod [jdk.graal.compiler.debug.GraalError: Compiler configuration 'XXcommunity' not found. Available configurations are: enterprise, community, economy]"). Use -Xlog:jit+compilation for more detail. > > > This PR fixes this by using `stringStream` instead of `err_msg` when creating these messages. This pull request has now been integrated. Changeset: 7c7b961e Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/7c7b961e732d1bef3c95a69758e283c8fb32fff6 Stats: 18 lines in 3 files changed: 10 ins; 0 del; 8 mod 8329191: JVMCI compiler warning is truncated Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/18513 From duke at openjdk.org Thu Mar 28 11:45:32 2024 From: duke at openjdk.org (SUN Guoyun) Date: Thu, 28 Mar 2024 11:45:32 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:32:29 GMT, Emanuel Peter wrote: > > What exactly is it that gives you the speedup in your benchmark? Spilling? Fewer add instructions? Would be nice to understand that better, and see what are potential examples where we would have regressions with your patch. That is fewer spilling and add instructions make the benchmark speedup. before opto:
	   subq    rsp, #32	# Create frame
02a     movq    RBP, [RSI + #16 (8-bit)]	# long ! Field: CallNode.val
02e     leaq    R10, [RBP + #3]
032     movq    [rsp + #0], R10	# spill
        nop 	# 1 bytes pad for loops and calls
037     call,static  CallNode::callNoInlineMethod
        # CallNode::test @ bci:10 (line 11) L[0]=_ L[1]=rsp + #0 L[2]=_
        # OopMap {off=60/0x3c}

044     B2: #	out( N41 ) <- in( B1 )  Freq: 0.99998
        # Block is sole successor of call
044     addq    RAX, RBP	# long
047     addq    RAX, #3	# long
04b     addq    rsp, 32	# Destroy frame
after opto:
  	   subq    rsp, #16	# Create frame
02a     movl    RBP, #3	# long (unsigned 32-bit)
02f     addq    RBP, [RSI + #16 (8-bit)]	# long
033     call,static  CallNode::callNoInlineMethod
        # CallNode::test @ bci:10 (line 11) L[0]=_ L[1]=RBP L[2]=_
        # OopMap {off=56/0x38}

040     B2: #	out( N36 ) <- in( B1 )  Freq: 0.99998
        # Block is sole successor of call
040     addq    RAX, RBP	# long
043     addq    rsp, 16	# Destroy frame

------------- PR Comment: https://git.openjdk.org/jdk/pull/18482#issuecomment-2024984725 From jbhateja at openjdk.org Thu Mar 28 11:45:40 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Mar 2024 11:45:40 GMT Subject: RFR: 8329254: optimize integral reverse operations on x86 GFNI target. Message-ID: <1i51xczi3Q5WG46f6dBmgkBzrKIo4aHi4M5t54ElymA=.4cc9f7ed-533a-480f-9177-cb3f534fa36c@github.com> - Efficient GFNI based instruction sequence to compute integral reverse operation was added along with JEP-426 (VectorAPI 4th Incubation). https://bugs.openjdk.org/browse/JDK-8284960 - However, the CPUID based feature detection for GFNI was incorrectly performed under AVX512 check, fixing it show roughly 2X performance improvement for Integer/Long.reverse APIs on E-core targets (MTL+). BaseLine: Benchmark (size) Mode Cnt Score Error Units Integers.reverse 500 avgt 2 0.120 us/op Longs.reverse 500 avgt 2 0.221 us/op Withopt: Benchmark (size) Mode Cnt Score Error Units Integers.reverse 500 avgt 2 0.050 us/op Longs.reverse 500 avgt 2 0.086 us/op Kindly review. Best Regards, Jatin ------------- Commit messages: - 8329254: optimize integral reverse operations on x86 GFNI target. Changes: https://git.openjdk.org/jdk/pull/18530/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18530&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329254 Stats: 4 lines in 1 file changed: 2 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18530.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18530/head:pull/18530 PR: https://git.openjdk.org/jdk/pull/18530 From epeter at openjdk.org Thu Mar 28 12:00:37 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 28 Mar 2024 12:00:37 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs In-Reply-To: References: Message-ID: On Tue, 19 Mar 2024 13:21:49 GMT, Roland Westrelin wrote: > Range check `CastII` nodes are removed once loop opts are over. The > test case for this change includes 3 cases where elimination of a > range check `CastII` causes a crash in compiled code because either a > out of bounds array load or a division by zero happen. > > In `test1`: > > - the range checks for the `array[otherArray.length]` loads constant > fold: `otherArray.length` is a `CastII` of i at the `otherArray` > allocation. `i` is less than 9. The `CastII` at the allocation > narrows the type down further to `[0-9]`. > > - the `array[otherArray.length]` loads are control dependent on the > unrelated: > > > if (flag == 0) { > > > test. There's an identical dominating test which replaces that one. As > a consequence, the `array[otherArray.length]` loads become control > dependent on the dominating test. > > - The `CastII` nodes at the `otherArray` allocations are replaced by a > dominating range check `CastII` nodes for: > > > newArray[i] = 42; > > > - After loop opts, the range check `CastII` nodes are removed and the > 2 `array[otherArray.length]` loads common at the first: > > > if (flag == 0) { > > > test before the: > > > float[] otherArray = new float[i]; > > > and > > > newArray[i] = 42; > > > that guarantee `i` is positive. > > - `test1` is called with `i = -1`, the array load proceeds with an out > of bounds index and the crash occurs. > > > `test2` and `test3` are mostly identical except for the check that's > eliminated (a null divisor check) and the instruction that causes a > fault (an integer division). > > The fix I propose is to not eliminate range check `CastII` nodes after > loop opts. When range check`CastII` nodes were introduced, performance > was observed to regress. Removing them after loop opts was found to > preserve both correctness and performance. Today, the performance > regression still exists when `CastII` nodes are left in. So I propose > we keep them until the end of optimizations (so the 2 array loads > above don't lose a dependency and wrongly common) but remove them at > the end of all optimizations. > > In the case of the array loads, they are dependent on a range check > for another array through a range check `CastII` and we must not lose > that dependency otherwise the array loads could float above the range > check at gcm time. I propose we deal with that problem the way it's > handled for `CastPP` nodes: add the dependency to the load (or > division)nodes as a precedence edge when the cast is removed. > > @TobiHartmann ran performance testing for that patch (Thanks!) and reported > no regression. Thanks @rwestrel ! Generally makes sense, I have a few suggestions and questions. src/hotspot/share/opto/castnode.cpp line 222: > 220: if (!_range_check_dependency) { > 221: res = widen_type(phase, res, T_INT); > 222: } Can you explain why you changed this, and why it is ok? src/hotspot/share/opto/castnode.cpp line 240: > 238: phase->C->record_for_post_loop_opts_igvn(this); > 239: } > 240: if (!_type->is_int()->empty()) { Can you also explain this change, please? src/hotspot/share/opto/compile.cpp line 3898: > 3896: // as a precedence edge, so they can't float above the cast in case that cast's narrowed type helped eliminated a > 3897: // range check or a null divisor check. > 3898: assert(cast->in(0) != nullptr, ""); Suggestion: assert(cast->in(0) != nullptr, "All RangeCheck CastII must have a control dependency"); test/hotspot/jtreg/compiler/rangechecks/TestArrayAccessAboveRCAfterRCCastIIEliminated.java line 39: > 37: * -XX:+StressIGVN -XX:StressSeed=94546681 TestArrayAccessAboveRCAfterRCCastIIEliminated > 38: * > 39: */ Can you please add a "vanilla" run like this: `@run main TestArrayAccessAboveRCAfterRCCastIIEliminated` That would allow us to run the test with any flag combination from the outside. test/hotspot/jtreg/compiler/rangechecks/TestArrayAccessAboveRCAfterRCCastIIEliminated.java line 41: > 39: */ > 40: > 41: public class TestArrayAccessAboveRCAfterRCCastIIEliminated { It seems this test is not in a package. Is this on purpose? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18377#pullrequestreview-1965886193 PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542784125 PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542784375 PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542774195 PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542786077 PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542786478 From epeter at openjdk.org Thu Mar 28 12:00:37 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 28 Mar 2024 12:00:37 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs In-Reply-To: References: Message-ID: <89wFzkpUcY3PKi_ypzouWAXDEa1iV35rq_nOqsOS62o=.b9af444c-e2b2-4621-b015-ae276c921542@github.com> On Thu, 28 Mar 2024 11:54:47 GMT, Emanuel Peter wrote: >> Range check `CastII` nodes are removed once loop opts are over. The >> test case for this change includes 3 cases where elimination of a >> range check `CastII` causes a crash in compiled code because either a >> out of bounds array load or a division by zero happen. >> >> In `test1`: >> >> - the range checks for the `array[otherArray.length]` loads constant >> fold: `otherArray.length` is a `CastII` of i at the `otherArray` >> allocation. `i` is less than 9. The `CastII` at the allocation >> narrows the type down further to `[0-9]`. >> >> - the `array[otherArray.length]` loads are control dependent on the >> unrelated: >> >> >> if (flag == 0) { >> >> >> test. There's an identical dominating test which replaces that one. As >> a consequence, the `array[otherArray.length]` loads become control >> dependent on the dominating test. >> >> - The `CastII` nodes at the `otherArray` allocations are replaced by a >> dominating range check `CastII` nodes for: >> >> >> newArray[i] = 42; >> >> >> - After loop opts, the range check `CastII` nodes are removed and the >> 2 `array[otherArray.length]` loads common at the first: >> >> >> if (flag == 0) { >> >> >> test before the: >> >> >> float[] otherArray = new float[i]; >> >> >> and >> >> >> newArray[i] = 42; >> >> >> that guarantee `i` is positive. >> >> - `test1` is called with `i = -1`, the array load proceeds with an out >> of bounds index and the crash occurs. >> >> >> `test2` and `test3` are mostly identical except for the check that's >> eliminated (a null divisor check) and the instruction that causes a >> fault (an integer division). >> >> The fix I propose is to not eliminate range check `CastII` nodes after >> loop opts. When range check`CastII` nodes were introduced, performance >> was observed to regress. Removing them after loop opts was found to >> preserve both correctness and performance. Today, the performance >> regression still exists when `CastII` nodes are left in. So I propose >> we keep them until the end of optimizations (so the 2 array loads >> above don't lose a dependency and wrongly common) but remove them at >> the end of all optimizations. >> >> In the case of the array loads, they are dependent on a range check >> for another array through a range check `CastII` and we must not lose >> that dependency otherwise the array loads could float above the range >> check at gcm time. I propose we deal with that problem the way it's >> handled for `CastPP` nodes: add the dependency to the load (or >> division)nodes ... > > test/hotspot/jtreg/compiler/rangechecks/TestArrayAccessAboveRCAfterRCCastIIEliminated.java line 39: > >> 37: * -XX:+StressIGVN -XX:StressSeed=94546681 TestArrayAccessAboveRCAfterRCCastIIEliminated >> 38: * >> 39: */ > > Can you please add a "vanilla" run like this: > `@run main TestArrayAccessAboveRCAfterRCCastIIEliminated` > That would allow us to run the test with any flag combination from the outside. And maybe one that only does `-XX:CompileCommand=dontinline,TestArrayAccessAboveRCAfterRCCastIIEliminated::notInlined`, since that seems important for reproducing the interesting patterns. > test/hotspot/jtreg/compiler/rangechecks/TestArrayAccessAboveRCAfterRCCastIIEliminated.java line 41: > >> 39: */ >> 40: >> 41: public class TestArrayAccessAboveRCAfterRCCastIIEliminated { > > It seems this test is not in a package. Is this on purpose? Not sure if that is important, but it seems most other tests are in a package ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542789546 PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542786857 From simonis at openjdk.org Thu Mar 28 12:08:01 2024 From: simonis at openjdk.org (Volker Simonis) Date: Thu, 28 Mar 2024 12:08:01 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v4] In-Reply-To: References: Message-ID: > Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). > > The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: > > public static native void emptyStaticNativeMethod(); > > @Benchmark > public static void baseline() { > } > > @Benchmark > public static void staticMethodCallingStatic() { > emptyStaticMethod(); > } > > @Benchmark > public static void staticMethodCallingStaticNative() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:-TieredCompilation") > public static void staticMethodCallingStaticNativeNoTiered() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") > public static void staticMethodCallingStaticNativeIntStub() { > emptyStaticNativeMethod(); > } > > > JDK 11 > ====== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op > > > JDK 17 & 21 > =========== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op > > > The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: > > 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) > @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method > 172 112 n 0 io.simonis.NativeCall::emptyStaticNativeMethod (native... Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: Compile native methods like trivial methods and added JTreg test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18496/files - new: https://git.openjdk.org/jdk/pull/18496/files/5b017d59..3324731a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18496&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18496&range=02-03 Stats: 119 lines in 3 files changed: 114 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18496.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18496/head:pull/18496 PR: https://git.openjdk.org/jdk/pull/18496 From simonis at openjdk.org Thu Mar 28 12:08:01 2024 From: simonis at openjdk.org (Volker Simonis) Date: Thu, 28 Mar 2024 12:08:01 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 09:59:51 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Fix indentation > > I think it will work, but wouldn't that instantly create a native wrapper for _every_ native method? Shouldn't we only create native wrappers for hot native methods? > > To get here it will have to be already relatively hot. And since we don't need to profile it I think it would be fair to treat it exactly the same way we treat trivial methods. OK, then better early than never :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18496#issuecomment-2025024101 From chagedorn at openjdk.org Thu Mar 28 12:19:34 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 12:19:34 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 16:13:35 GMT, Emanuel Peter wrote: >> Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Change from DFS to 2xBFS >> - Review Emanuel first part >> - Merge branch 'master' into JDK-8327110 >> - 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix pure cloning cases used for Loop Unswitching and Split If > > test/hotspot/jtreg/compiler/predicates/TestCloningWithManyDiamondsInExpression.java line 28: > >> 26: * @test >> 27: * @bug 8327110 >> 28: * @requires vm.compiler2.enabled > > Is this required? > Maybe we can move the ` * @run main compiler.predicates.TestCloningWithManyDiamondsInExpression` run to a new `@test` block that does not have this `@requires`? Then other compilers also get tested. As discussed offline, I'm planning to do that but the test case would currently fail and is only gonna work once the next PR is in (JDK-8327111). I will update the test with that PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1542851376 From chagedorn at openjdk.org Thu Mar 28 12:25:32 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 12:25:32 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v2] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 16:05:10 GMT, Emanuel Peter wrote: > What scenario is there where we collect no nodes? Probably there is some, where we don't have to clone anything.. but it's a bit strange. Would be nice to have a quick comment here about that. Sometimes, we want to assert that a node does not represent a Template Assertion Predicate anymore but is actually an initialized Assertion Predicate (without `OpaqueLoop*Nodes`). Example: https://github.com/openjdk/jdk/blob/2af0312c958e693b1377f4c014ae8f84cabf6b83/src/hotspot/share/opto/loopTransform.cpp#L3044-L3047 Therefore, I'm planning to re-use this class to also verify the absence of `OpaqueLoop*Nodes` in a later PR. > For simplicity, you could always call backtrack (it just does nothing anyway). Then you can just make the assert a bit smarter: assert(_collected_nodes.size() == 0 || _collected_nodes.member(start_node), "must find start node again when backtracking"); Good point, I will do that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18293#discussion_r1542862135 From chagedorn at openjdk.org Thu Mar 28 12:32:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 12:32:59 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v3] In-Reply-To: References: Message-ID: <8UdoeQB0Qz7Lzb-SZeOpf8V9IyXcmeKKyOHzQz0E5GE=.9550067d-93fa-4915-a06c-cbba220f2893@github.com> > This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. > > #### How `create_bool_from_template_assertion_predicate()` Works > Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: > 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): > https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 > > 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. > > #### Missing Visited Set > The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: > > > ... > | > E > | > D > / \ > B C > \ / > A > > DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... > > With each diamond, the number of revisits of each node above doubles. > > #### Endless DFS in Edge-Cases > In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because background compilation is disabled). > > #### New DFS Implem... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Moved comment + better assert ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18293/files - new: https://git.openjdk.org/jdk/pull/18293/files/dbd0caba..ffabaa4d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18293&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18293&range=01-02 Stats: 8 lines in 1 file changed: 2 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/18293.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18293/head:pull/18293 PR: https://git.openjdk.org/jdk/pull/18293 From roland at openjdk.org Thu Mar 28 13:37:32 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 13:37:32 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 11:53:02 GMT, Emanuel Peter wrote: >> Range check `CastII` nodes are removed once loop opts are over. The >> test case for this change includes 3 cases where elimination of a >> range check `CastII` causes a crash in compiled code because either a >> out of bounds array load or a division by zero happen. >> >> In `test1`: >> >> - the range checks for the `array[otherArray.length]` loads constant >> fold: `otherArray.length` is a `CastII` of i at the `otherArray` >> allocation. `i` is less than 9. The `CastII` at the allocation >> narrows the type down further to `[0-9]`. >> >> - the `array[otherArray.length]` loads are control dependent on the >> unrelated: >> >> >> if (flag == 0) { >> >> >> test. There's an identical dominating test which replaces that one. As >> a consequence, the `array[otherArray.length]` loads become control >> dependent on the dominating test. >> >> - The `CastII` nodes at the `otherArray` allocations are replaced by a >> dominating range check `CastII` nodes for: >> >> >> newArray[i] = 42; >> >> >> - After loop opts, the range check `CastII` nodes are removed and the >> 2 `array[otherArray.length]` loads common at the first: >> >> >> if (flag == 0) { >> >> >> test before the: >> >> >> float[] otherArray = new float[i]; >> >> >> and >> >> >> newArray[i] = 42; >> >> >> that guarantee `i` is positive. >> >> - `test1` is called with `i = -1`, the array load proceeds with an out >> of bounds index and the crash occurs. >> >> >> `test2` and `test3` are mostly identical except for the check that's >> eliminated (a null divisor check) and the instruction that causes a >> fault (an integer division). >> >> The fix I propose is to not eliminate range check `CastII` nodes after >> loop opts. When range check`CastII` nodes were introduced, performance >> was observed to regress. Removing them after loop opts was found to >> preserve both correctness and performance. Today, the performance >> regression still exists when `CastII` nodes are left in. So I propose >> we keep them until the end of optimizations (so the 2 array loads >> above don't lose a dependency and wrongly common) but remove them at >> the end of all optimizations. >> >> In the case of the array loads, they are dependent on a range check >> for another array through a range check `CastII` and we must not lose >> that dependency otherwise the array loads could float above the range >> check at gcm time. I propose we deal with that problem the way it's >> handled for `CastPP` nodes: add the dependency to the load (or >> division)nodes ... > > src/hotspot/share/opto/castnode.cpp line 222: > >> 220: if (!_range_check_dependency) { >> 221: res = widen_type(phase, res, T_INT); >> 222: } > > Can you explain why you changed this, and why it is ok? `ConvI2L` has a similar transformation. Let's say we have 2 `ConvI2L` nodes with identical inputs but different types: (ConvI2L _ input [2..max_jint]) (ConvI2L _ input [1..max_jint]) They are transformed to: (ConvI2L _ input [0..max_jint]) (ConvI2L _ input [0..max_jint]) so they can common. With range checks, the pattern is: (ConvI2L _ (CastII control input [2..max_jint]) [2..max_jint]) (ConvI2L _ (CastII control input [1..max_jint]) [1..max_jint] Without this patch, the range checks `CastII` are removed after loop opts so having the transformation be done only on `ConvI2L` nodes is sufficient. With this change the `CastII` are left in the IR so they need to be transformed the same way: (ConvI2L _ (CastII control input [0..max_jint]) [0..max_jint]) (ConvI2L _ (CastII control input [0..max_jint]) [0..max_jint] so they can common and the `ConvI2L` then common. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1542993546 From roland at openjdk.org Thu Mar 28 13:41:33 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 13:41:33 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 11:53:18 GMT, Emanuel Peter wrote: >> Range check `CastII` nodes are removed once loop opts are over. The >> test case for this change includes 3 cases where elimination of a >> range check `CastII` causes a crash in compiled code because either a >> out of bounds array load or a division by zero happen. >> >> In `test1`: >> >> - the range checks for the `array[otherArray.length]` loads constant >> fold: `otherArray.length` is a `CastII` of i at the `otherArray` >> allocation. `i` is less than 9. The `CastII` at the allocation >> narrows the type down further to `[0-9]`. >> >> - the `array[otherArray.length]` loads are control dependent on the >> unrelated: >> >> >> if (flag == 0) { >> >> >> test. There's an identical dominating test which replaces that one. As >> a consequence, the `array[otherArray.length]` loads become control >> dependent on the dominating test. >> >> - The `CastII` nodes at the `otherArray` allocations are replaced by a >> dominating range check `CastII` nodes for: >> >> >> newArray[i] = 42; >> >> >> - After loop opts, the range check `CastII` nodes are removed and the >> 2 `array[otherArray.length]` loads common at the first: >> >> >> if (flag == 0) { >> >> >> test before the: >> >> >> float[] otherArray = new float[i]; >> >> >> and >> >> >> newArray[i] = 42; >> >> >> that guarantee `i` is positive. >> >> - `test1` is called with `i = -1`, the array load proceeds with an out >> of bounds index and the crash occurs. >> >> >> `test2` and `test3` are mostly identical except for the check that's >> eliminated (a null divisor check) and the instruction that causes a >> fault (an integer division). >> >> The fix I propose is to not eliminate range check `CastII` nodes after >> loop opts. When range check`CastII` nodes were introduced, performance >> was observed to regress. Removing them after loop opts was found to >> preserve both correctness and performance. Today, the performance >> regression still exists when `CastII` nodes are left in. So I propose >> we keep them until the end of optimizations (so the 2 array loads >> above don't lose a dependency and wrongly common) but remove them at >> the end of all optimizations. >> >> In the case of the array loads, they are dependent on a range check >> for another array through a range check `CastII` and we must not lose >> that dependency otherwise the array loads could float above the range >> check at gcm time. I propose we deal with that problem the way it's >> handled for `CastPP` nodes: add the dependency to the load (or >> division)nodes ... > > src/hotspot/share/opto/castnode.cpp line 240: > >> 238: phase->C->record_for_post_loop_opts_igvn(this); >> 239: } >> 240: if (!_type->is_int()->empty()) { > > Can you also explain this change, please? It's similar to the case above. `ConvI2L` has a similar transformation (`Add` node pushed through `ConvI2L`). For range checks, the `CastII` gets in the way but it wasn't an issue without the patch I propose (the `CastII` ends up going away). Leaving the `CastII` requires that the transformation happens on the `CastII` as well. The: !_type->is_int()->empty() test doesn't replace the !_range_check_dependency I removed the second one, ran some testing and then had a failure where the `ConstraintCastNode::optimize_integer_cast()` crashes because it gets called with an empty type so I added the test for an empty type. I don't remember the details of the failure other than it seemed like a corner case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1543001759 From roland at openjdk.org Thu Mar 28 13:45:33 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 13:45:33 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs In-Reply-To: <89wFzkpUcY3PKi_ypzouWAXDEa1iV35rq_nOqsOS62o=.b9af444c-e2b2-4621-b015-ae276c921542@github.com> References: <89wFzkpUcY3PKi_ypzouWAXDEa1iV35rq_nOqsOS62o=.b9af444c-e2b2-4621-b015-ae276c921542@github.com> Message-ID: On Thu, 28 Mar 2024 11:55:28 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/rangechecks/TestArrayAccessAboveRCAfterRCCastIIEliminated.java line 41: >> >>> 39: */ >>> 40: >>> 41: public class TestArrayAccessAboveRCAfterRCCastIIEliminated { >> >> It seems this test is not in a package. Is this on purpose? > > Not sure if that is important, but it seems most other tests are in a package I never put tests in a package. So If there's an issue with that, then there are many more tests to fix. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1543006212 From chagedorn at openjdk.org Thu Mar 28 13:48:32 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 13:48:32 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class In-Reply-To: References: <0y-TzdF1qxDiFz2ZyTYon6GVd_F5wN_bPdi4Evt8CPg=.b390ea96-81e5-49a4-8d21-9d983a7fb892@github.com> Message-ID: On Wed, 27 Mar 2024 16:47:05 GMT, Roland Westrelin wrote: >> I was surprised by that as well when I've run testing but `gen_checkcast()` is actually also called in `Parse::array_store_check()` which passes a `LoadKlassNode` which is not a constant: >> >> https://github.com/openjdk/jdk/blob/05854fd704cba6ebd73007d9547a064891d49587/src/hotspot/share/opto/parseHelper.cpp#L233-L238 >> >> It states that it ignores the result and just calls it for the CFG effects - does not sound like a very clean solution. > > I'm confused by this. Looking at `TypeInstKlassPtr::try_improve()` returns a constant only if it finds a unique subklass and that subklass is final. So it should not be uncommon to have an `improved_klass_ptr_type` that's not constant. That's true. I've missed that we could also improve non-constants like `LoadKlass`, for example. I'm not sure if we could also have other improved types here apart from constants or `LoadKlass` (from `array_store_check()`). The only other non-constant `superklass` is passed by `LibraryCallKit::inline_Class_cast()`. From there, we either get a constant or a `CastPPNode` from which I'm not sure if it can really be improved. Either way, I think we could improve the code like that to get an improved type regardless of the node, what do you think? if (improved_klass_ptr_type != klass_ptr_type) { if (improved_klass_ptr_type->singleton()) { improved_superklass = makecon(improved_klass_ptr_type); } else { superklass->raise_bottom_type(improved_klass_ptr_type); _gvn.set_type(superklass, improved_klass_ptr_type); } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1543011083 From roland at openjdk.org Thu Mar 28 13:55:32 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 13:55:32 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class In-Reply-To: References: <0y-TzdF1qxDiFz2ZyTYon6GVd_F5wN_bPdi4Evt8CPg=.b390ea96-81e5-49a4-8d21-9d983a7fb892@github.com> Message-ID: On Thu, 28 Mar 2024 13:45:56 GMT, Christian Hagedorn wrote: >> I'm confused by this. Looking at `TypeInstKlassPtr::try_improve()` returns a constant only if it finds a unique subklass and that subklass is final. So it should not be uncommon to have an `improved_klass_ptr_type` that's not constant. > > That's true. I've missed that we could also improve non-constants like `LoadKlass`, for example. I'm not sure if we could also have other improved types here apart from constants or `LoadKlass` (from `array_store_check()`). The only other non-constant `superklass` is passed by `LibraryCallKit::inline_Class_cast()`. From there, we either get a constant or a `CastPPNode` from which I'm not sure if it can really be improved. > > Either way, I think we could improve the code like that to get an improved type regardless of the node, what do you think? > > > if (improved_klass_ptr_type != klass_ptr_type) { > if (improved_klass_ptr_type->singleton()) { > improved_superklass = makecon(improved_klass_ptr_type); > } else { > superklass->raise_bottom_type(improved_klass_ptr_type); > _gvn.set_type(superklass, improved_klass_ptr_type); > } > } That's what I was wondering too. Wouldn't casting `superklass` to `improved_klass_ptr_type` do all of this in a cleaner way? gvn would constant fold the result if it can. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1543022425 From epeter at openjdk.org Thu Mar 28 13:57:45 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 28 Mar 2024 13:57:45 GMT Subject: RFR: 8328938: C2 SuperWord: disable vectorization for large stride and scale [v2] In-Reply-To: <9oXO4yuvZbpAxofIUBGVwJ2WyBLPWcP2IHxqZg5nQNQ=.f8f9365c-56c5-4fa9-8075-880f432ac214@github.com> References: <9oXO4yuvZbpAxofIUBGVwJ2WyBLPWcP2IHxqZg5nQNQ=.f8f9365c-56c5-4fa9-8075-880f432ac214@github.com> Message-ID: <9fehmV7ApBB0zWUi5j9fXWGXBVo-0R_EugkCRJ1q2Sw=.1453b091-2231-41ab-9e11-095d3b73a103@github.com> > **Problem** > In [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) / https://git.openjdk.org/jdk/pull/14785 I fixed the alignment with `AlignVector`. For that, I had to compute `abs(scale)` and `abs(stride)`, as well as `scale * stride`. The issue is that all of these values can overflow the int range (e.g. `abs(min_int) = min_int`). > > We hit asserts like: > > `# assert(is_power_of_2(value)) failed: value must be a power of 2: 0xffffffff80000000` > Happens because we take `abs(min_int)`, which is `min_int = 0x80000000`, and assuming this was a positive (unsigned) number is a power of 2 `2^31`. We then expand it to `long`, get `0xffffffff80000000`, which is not a power of 2 anymore. This violates the implicit assumptions, and we hit the assert. > > `# assert(q >= 1) failed: modulo value must be large enough` > We have `scale = 2^30` and `stride = 4 = 2^2`. For the alignment calculation we compute `scale * stride = 2^32`, which overflows the int range and becomes zero. > > Before [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) we could get similar issues with the (old) code in `SuperWord::ref_is_alignable`, if `AlignVector` is enabled: > > > int span = preloop_stride * p.scale_in_bytes(); > ... > if (vw % span == 0) { > > > if `span == 0` because of overflow, then the `idiv` from the modulo gets a division by zero -> `SIGFPE`. > > But it seems the bug is possibly a regression from JDK20 b2 [JDK-8286197](https://bugs.openjdk.org/browse/JDK-8286197). Here we enabled certaint Unsafe memory access address patterns, and it is such patterns that the reproducer requires. > > **Solution** > I could either patch up all the code that works with `scale` and `stride`, and make sure no overflows ever happen. But that is quite involved and error prone. > > I now just disable vectorization for large `scale` and `stride`. This should not have any performance impact, because such large `scale` and `stride` would lead to highly inefficient memory accesses, since they are spaced very far apart. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: improve comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18485/files - new: https://git.openjdk.org/jdk/pull/18485/files/693ce84e..4d9e05d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18485&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18485&range=00-01 Stats: 9 lines in 1 file changed: 6 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18485.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18485/head:pull/18485 PR: https://git.openjdk.org/jdk/pull/18485 From epeter at openjdk.org Thu Mar 28 13:59:31 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 28 Mar 2024 13:59:31 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v3] In-Reply-To: <8UdoeQB0Qz7Lzb-SZeOpf8V9IyXcmeKKyOHzQz0E5GE=.9550067d-93fa-4915-a06c-cbba220f2893@github.com> References: <8UdoeQB0Qz7Lzb-SZeOpf8V9IyXcmeKKyOHzQz0E5GE=.9550067d-93fa-4915-a06c-cbba220f2893@github.com> Message-ID: On Thu, 28 Mar 2024 12:32:59 GMT, Christian Hagedorn wrote: >> This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. >> >> #### How `create_bool_from_template_assertion_predicate()` Works >> Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: >> 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): >> https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 >> >> 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. >> >> #### Missing Visited Set >> The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: >> >> >> ... >> | >> E >> | >> D >> / \ >> B C >> \ / >> A >> >> DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... >> >> With each diamond, the number of revisits of each node above doubles. >> >> #### Endless DFS in Edge-Cases >> In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Moved comment + better assert Thanks for the updates and explanation, looks good to me now ? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18293#pullrequestreview-1966304759 From roland at openjdk.org Thu Mar 28 14:14:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 14:14:57 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs [v2] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 11:57:59 GMT, Emanuel Peter wrote: > Thanks @rwestrel ! Generally makes sense, I have a few suggestions and questions. Thanks for reviewing this. > src/hotspot/share/opto/compile.cpp line 3898: > >> 3896: // as a precedence edge, so they can't float above the cast in case that cast's narrowed type helped eliminated a >> 3897: // range check or a null divisor check. >> 3898: assert(cast->in(0) != nullptr, ""); > > Suggestion: > > assert(cast->in(0) != nullptr, "All RangeCheck CastII must have a control dependency"); Done in new commit. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18377#issuecomment-2025282505 PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1543052231 From roland at openjdk.org Thu Mar 28 14:14:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 14:14:57 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs [v2] In-Reply-To: <89wFzkpUcY3PKi_ypzouWAXDEa1iV35rq_nOqsOS62o=.b9af444c-e2b2-4621-b015-ae276c921542@github.com> References: <89wFzkpUcY3PKi_ypzouWAXDEa1iV35rq_nOqsOS62o=.b9af444c-e2b2-4621-b015-ae276c921542@github.com> Message-ID: On Thu, 28 Mar 2024 11:56:53 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/rangechecks/TestArrayAccessAboveRCAfterRCCastIIEliminated.java line 39: >> >>> 37: * -XX:+StressIGVN -XX:StressSeed=94546681 TestArrayAccessAboveRCAfterRCCastIIEliminated >>> 38: * >>> 39: */ >> >> Can you please add a "vanilla" run like this: >> `@run main TestArrayAccessAboveRCAfterRCCastIIEliminated` >> That would allow us to run the test with any flag combination from the outside. > > And maybe one that only does `-XX:CompileCommand=dontinline,TestArrayAccessAboveRCAfterRCCastIIEliminated::notInlined`, since that seems important for reproducing the interesting patterns. Done in new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18377#discussion_r1543051958 From roland at openjdk.org Thu Mar 28 14:14:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 14:14:57 GMT Subject: RFR: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs [v2] In-Reply-To: References: Message-ID: > Range check `CastII` nodes are removed once loop opts are over. The > test case for this change includes 3 cases where elimination of a > range check `CastII` causes a crash in compiled code because either a > out of bounds array load or a division by zero happen. > > In `test1`: > > - the range checks for the `array[otherArray.length]` loads constant > fold: `otherArray.length` is a `CastII` of i at the `otherArray` > allocation. `i` is less than 9. The `CastII` at the allocation > narrows the type down further to `[0-9]`. > > - the `array[otherArray.length]` loads are control dependent on the > unrelated: > > > if (flag == 0) { > > > test. There's an identical dominating test which replaces that one. As > a consequence, the `array[otherArray.length]` loads become control > dependent on the dominating test. > > - The `CastII` nodes at the `otherArray` allocations are replaced by a > dominating range check `CastII` nodes for: > > > newArray[i] = 42; > > > - After loop opts, the range check `CastII` nodes are removed and the > 2 `array[otherArray.length]` loads common at the first: > > > if (flag == 0) { > > > test before the: > > > float[] otherArray = new float[i]; > > > and > > > newArray[i] = 42; > > > that guarantee `i` is positive. > > - `test1` is called with `i = -1`, the array load proceeds with an out > of bounds index and the crash occurs. > > > `test2` and `test3` are mostly identical except for the check that's > eliminated (a null divisor check) and the instruction that causes a > fault (an integer division). > > The fix I propose is to not eliminate range check `CastII` nodes after > loop opts. When range check`CastII` nodes were introduced, performance > was observed to regress. Removing them after loop opts was found to > preserve both correctness and performance. Today, the performance > regression still exists when `CastII` nodes are left in. So I propose > we keep them until the end of optimizations (so the 2 array loads > above don't lose a dependency and wrongly common) but remove them at > the end of all optimizations. > > In the case of the array loads, they are dependent on a range check > for another array through a range check `CastII` and we must not lose > that dependency otherwise the array loads could float above the range > check at gcm time. I propose we deal with that problem the way it's > handled for `CastPP` nodes: add the dependency to the load (or > division)nodes as a precedence edge when the cast is removed. > > @TobiHartmann ran performance testing for that patch (Thanks!) and reported > no regression. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - review - Merge branch 'master' into JDK-8324517 - test and fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18377/files - new: https://git.openjdk.org/jdk/pull/18377/files/182cd4fc..0de61cbc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18377&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18377&range=00-01 Stats: 373272 lines in 3153 files changed: 27823 ins; 20294 del; 325155 mod Patch: https://git.openjdk.org/jdk/pull/18377.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18377/head:pull/18377 PR: https://git.openjdk.org/jdk/pull/18377 From chagedorn at openjdk.org Thu Mar 28 15:06:33 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 15:06:33 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If [v3] In-Reply-To: <8UdoeQB0Qz7Lzb-SZeOpf8V9IyXcmeKKyOHzQz0E5GE=.9550067d-93fa-4915-a06c-cbba220f2893@github.com> References: <8UdoeQB0Qz7Lzb-SZeOpf8V9IyXcmeKKyOHzQz0E5GE=.9550067d-93fa-4915-a06c-cbba220f2893@github.com> Message-ID: On Thu, 28 Mar 2024 12:32:59 GMT, Christian Hagedorn wrote: >> This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. >> >> #### How `create_bool_from_template_assertion_predicate()` Works >> Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: >> 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): >> https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 >> >> 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. >> >> #### Missing Visited Set >> The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: >> >> >> ... >> | >> E >> | >> D >> / \ >> B C >> \ / >> A >> >> DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... >> >> With each diamond, the number of revisits of each node above doubles. >> >> #### Endless DFS in Edge-Cases >> In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Moved comment + better assert Thanks a lot Emanuel for your careful review and the offline discussions! :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18293#issuecomment-2025447673 From chagedorn at openjdk.org Thu Mar 28 15:11:01 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 15:11:01 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class [v2] In-Reply-To: References: <0y-TzdF1qxDiFz2ZyTYon6GVd_F5wN_bPdi4Evt8CPg=.b390ea96-81e5-49a4-8d21-9d983a7fb892@github.com> Message-ID: On Thu, 28 Mar 2024 13:53:24 GMT, Roland Westrelin wrote: >> That's true. I've missed that we could also improve non-constants like `LoadKlass`, for example. I'm not sure if we could also have other improved types here apart from constants or `LoadKlass` (from `array_store_check()`). The only other non-constant `superklass` is passed by `LibraryCallKit::inline_Class_cast()`. From there, we either get a constant or a `CastPPNode` from which I'm not sure if it can really be improved. >> >> Either way, I think we could improve the code like that to get an improved type regardless of the node, what do you think? >> >> >> if (improved_klass_ptr_type != klass_ptr_type) { >> if (improved_klass_ptr_type->singleton()) { >> improved_superklass = makecon(improved_klass_ptr_type); >> } else { >> superklass->raise_bottom_type(improved_klass_ptr_type); >> _gvn.set_type(superklass, improved_klass_ptr_type); >> } >> } > > That's what I was wondering too. Wouldn't casting `superklass` to `improved_klass_ptr_type` do all of this in a cleaner way? gvn would constant fold the result if it can. That's even better and easier. I've pushed an update. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1543141166 From chagedorn at openjdk.org Thu Mar 28 15:11:01 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 15:11:01 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class [v2] In-Reply-To: References: Message-ID: > While working on a [Valhalla bug](https://bugs.openjdk.org/browse/JDK-8321734), I've noticed that a `SubTypeCheckNode` for a `checkcast` does not take a unique concrete sub class `X` of an abstract class `A` as klass constant in the sub type check. Instead, it uses the abstract klass constant: > > > abstract class A {} > class X extends A {} > > A x = (A)object; // Emits SubTypeCheckNode(object, A), but could have used X instead of A. > > However, the `CheckCastPP` result already uses the improved instance type ptr `X` (i.e. `toop` which was improved from `A` by calling `try_improve()` to get the unique concrete sub class): > https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3257-L3261 > https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3363 > > We should also plug in a unique concrete sub class constant in the `SubTypeCheckNode` which could be beneficial to fold away redundant sub type checks (see test cases). > > This fix is required to completely fix the bug in Valhalla (this is only one of the broken cases). In Valhalla, the graph ends up being broken because a `CheckCastPP` node is folded because of an impossible type but the `SubTypeCheckNode` is not due to not using the improved unique concrete sub class constant for the `checkcast`. I don't think that there is currently a bug in mainline because of this limitation - it just blocks some optimizations. I'm therefore upstreaming this fix to mainline since it can be beneficial to have this fix here as well (see test cases). > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: using improved type for non-constants ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18515/files - new: https://git.openjdk.org/jdk/pull/18515/files/26a717db..660c0dec Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18515&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18515&range=00-01 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18515.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18515/head:pull/18515 PR: https://git.openjdk.org/jdk/pull/18515 From roland at openjdk.org Thu Mar 28 15:16:33 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 15:16:33 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class [v2] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 15:11:01 GMT, Christian Hagedorn wrote: >> While working on a [Valhalla bug](https://bugs.openjdk.org/browse/JDK-8321734), I've noticed that a `SubTypeCheckNode` for a `checkcast` does not take a unique concrete sub class `X` of an abstract class `A` as klass constant in the sub type check. Instead, it uses the abstract klass constant: >> >> >> abstract class A {} >> class X extends A {} >> >> A x = (A)object; // Emits SubTypeCheckNode(object, A), but could have used X instead of A. >> >> However, the `CheckCastPP` result already uses the improved instance type ptr `X` (i.e. `toop` which was improved from `A` by calling `try_improve()` to get the unique concrete sub class): >> https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3257-L3261 >> https://github.com/openjdk/jdk/blob/614db2ea9e10346475eef34629eab54878aa482d/src/hotspot/share/opto/graphKit.cpp#L3363 >> >> We should also plug in a unique concrete sub class constant in the `SubTypeCheckNode` which could be beneficial to fold away redundant sub type checks (see test cases). >> >> This fix is required to completely fix the bug in Valhalla (this is only one of the broken cases). In Valhalla, the graph ends up being broken because a `CheckCastPP` node is folded because of an impossible type but the `SubTypeCheckNode` is not due to not using the improved unique concrete sub class constant for the `checkcast`. I don't think that there is currently a bug in mainline because of this limitation - it just blocks some optimizations. I'm therefore upstreaming this fix to mainline since it can be beneficial to have this fix here as well (see test cases). >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > using improved type for non-constants src/hotspot/share/opto/graphKit.cpp line 3364: > 3362: if (improved_klass_ptr_type != klass_ptr_type) { > 3363: if (improved_klass_ptr_type->singleton()) { > 3364: improved_superklass = makecon(improved_klass_ptr_type); Do you really need to special case that one? Wouldn't the CastPP constant fold if `improved_klass_ptr_type` is a singleton? Also there is the question of whether a `CastPP` or a `CheckCastPP` should be used here and if it matters. They don't differ that much anymore so I suppose `CastPP` is ok. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1543151122 From jkarthikeyan at openjdk.org Thu Mar 28 15:18:40 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 28 Mar 2024 15:18:40 GMT Subject: RFR: 8329194: Cleanup Type::cmp definition and usage Message-ID: Hi all, this patch aims to cleanup `Type::cmp` by changing it from returning a `0` when types are equal and `1` when they are not, to it returning a boolean denoting equality. This makes its usages at various callsites more intuitive. However, as it is passed to the type dictionary as a comparator, a lambda is needed to map the boolean to a comparison value. I was also considering changing the name to `Type::equals` as it's not really returning a comparison value anymore, but I felt it would be too similar to `Type::eq`. If this would be preferred though, I can change it. Tier 1 testing passes on my machine. Reviews and thoughts would be appreciated! ------------- Commit messages: - Cleanup Type::cmp Changes: https://git.openjdk.org/jdk/pull/18533/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18533&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329194 Stats: 34 lines in 9 files changed: 6 ins; 0 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/18533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18533/head:pull/18533 PR: https://git.openjdk.org/jdk/pull/18533 From chagedorn at openjdk.org Thu Mar 28 15:19:33 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 15:19:33 GMT Subject: RFR: 8328938: C2 SuperWord: disable vectorization for large stride and scale [v2] In-Reply-To: <9fehmV7ApBB0zWUi5j9fXWGXBVo-0R_EugkCRJ1q2Sw=.1453b091-2231-41ab-9e11-095d3b73a103@github.com> References: <9oXO4yuvZbpAxofIUBGVwJ2WyBLPWcP2IHxqZg5nQNQ=.f8f9365c-56c5-4fa9-8075-880f432ac214@github.com> <9fehmV7ApBB0zWUi5j9fXWGXBVo-0R_EugkCRJ1q2Sw=.1453b091-2231-41ab-9e11-095d3b73a103@github.com> Message-ID: <2h8u7s091pfLxdAO6lM560QeQ5dnBannHGzZAb2Rcl8=.be7461a4-9a75-46c7-8e42-6c4ee53dc500@github.com> On Thu, 28 Mar 2024 13:57:45 GMT, Emanuel Peter wrote: >> **Problem** >> In [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) / https://git.openjdk.org/jdk/pull/14785 I fixed the alignment with `AlignVector`. For that, I had to compute `abs(scale)` and `abs(stride)`, as well as `scale * stride`. The issue is that all of these values can overflow the int range (e.g. `abs(min_int) = min_int`). >> >> We hit asserts like: >> >> `# assert(is_power_of_2(value)) failed: value must be a power of 2: 0xffffffff80000000` >> Happens because we take `abs(min_int)`, which is `min_int = 0x80000000`, and assuming this was a positive (unsigned) number is a power of 2 `2^31`. We then expand it to `long`, get `0xffffffff80000000`, which is not a power of 2 anymore. This violates the implicit assumptions, and we hit the assert. >> >> `# assert(q >= 1) failed: modulo value must be large enough` >> We have `scale = 2^30` and `stride = 4 = 2^2`. For the alignment calculation we compute `scale * stride = 2^32`, which overflows the int range and becomes zero. >> >> Before [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) we could get similar issues with the (old) code in `SuperWord::ref_is_alignable`, if `AlignVector` is enabled: >> >> >> int span = preloop_stride * p.scale_in_bytes(); >> ... >> if (vw % span == 0) { >> >> >> if `span == 0` because of overflow, then the `idiv` from the modulo gets a division by zero -> `SIGFPE`. >> >> But it seems the bug is possibly a regression from JDK20 b2 [JDK-8286197](https://bugs.openjdk.org/browse/JDK-8286197). Here we enabled certaint Unsafe memory access address patterns, and it is such patterns that the reproducer requires. >> >> **Solution** >> I could either patch up all the code that works with `scale` and `stride`, and make sure no overflows ever happen. But that is quite involved and error prone. >> >> I now just disable vectorization for large `scale` and `stride`. This should not have any performance impact, because such large `scale` and `stride` would lead to highly inefficient memory accesses, since they are spaced very far apart. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > improve comments I agree with that. That's a reasonable, safe and easy solution. src/hotspot/share/opto/vectorization.cpp line 396: > 394: NOT_PRODUCT(_tracer.ctor_6(mem);) > 395: > 396: // In the pointer analysis, and especially the AlignVector analysis we assume that Suggestion: // In the pointer analysis, and especially the AlignVector, analysis we assume that src/hotspot/share/opto/vectorization.cpp line 402: > 400: // to at least allow small and moderately large stride and scale. Therefore, we > 401: // allow values up to 2^30, which is only a factor 2 smaller than the max/min int > 402: // values. Normal performance relevant code will have much lower. And the restriction Suggestion: // allow values up to 2^30, which is only a factor 2 smaller than the max/min int. // Normal performance relevant code will have much lower values. And the restriction ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18485#pullrequestreview-1966505218 PR Review Comment: https://git.openjdk.org/jdk/pull/18485#discussion_r1543149553 PR Review Comment: https://git.openjdk.org/jdk/pull/18485#discussion_r1543154067 From epeter at openjdk.org Thu Mar 28 15:37:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 28 Mar 2024 15:37:56 GMT Subject: RFR: 8328938: C2 SuperWord: disable vectorization for large stride and scale [v3] In-Reply-To: <9oXO4yuvZbpAxofIUBGVwJ2WyBLPWcP2IHxqZg5nQNQ=.f8f9365c-56c5-4fa9-8075-880f432ac214@github.com> References: <9oXO4yuvZbpAxofIUBGVwJ2WyBLPWcP2IHxqZg5nQNQ=.f8f9365c-56c5-4fa9-8075-880f432ac214@github.com> Message-ID: > **Problem** > In [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) / https://git.openjdk.org/jdk/pull/14785 I fixed the alignment with `AlignVector`. For that, I had to compute `abs(scale)` and `abs(stride)`, as well as `scale * stride`. The issue is that all of these values can overflow the int range (e.g. `abs(min_int) = min_int`). > > We hit asserts like: > > `# assert(is_power_of_2(value)) failed: value must be a power of 2: 0xffffffff80000000` > Happens because we take `abs(min_int)`, which is `min_int = 0x80000000`, and assuming this was a positive (unsigned) number is a power of 2 `2^31`. We then expand it to `long`, get `0xffffffff80000000`, which is not a power of 2 anymore. This violates the implicit assumptions, and we hit the assert. > > `# assert(q >= 1) failed: modulo value must be large enough` > We have `scale = 2^30` and `stride = 4 = 2^2`. For the alignment calculation we compute `scale * stride = 2^32`, which overflows the int range and becomes zero. > > Before [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) we could get similar issues with the (old) code in `SuperWord::ref_is_alignable`, if `AlignVector` is enabled: > > > int span = preloop_stride * p.scale_in_bytes(); > ... > if (vw % span == 0) { > > > if `span == 0` because of overflow, then the `idiv` from the modulo gets a division by zero -> `SIGFPE`. > > But it seems the bug is possibly a regression from JDK20 b2 [JDK-8286197](https://bugs.openjdk.org/browse/JDK-8286197). Here we enabled certaint Unsafe memory access address patterns, and it is such patterns that the reproducer requires. > > **Solution** > I could either patch up all the code that works with `scale` and `stride`, and make sure no overflows ever happen. But that is quite involved and error prone. > > I now just disable vectorization for large `scale` and `stride`. This should not have any performance impact, because such large `scale` and `stride` would lead to highly inefficient memory accesses, since they are spaced very far apart. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18485/files - new: https://git.openjdk.org/jdk/pull/18485/files/4d9e05d2..9f8ed495 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18485&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18485&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18485.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18485/head:pull/18485 PR: https://git.openjdk.org/jdk/pull/18485 From iveresov at openjdk.org Thu Mar 28 15:48:33 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Thu, 28 Mar 2024 15:48:33 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v4] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 12:08:01 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Compile native methods like trivial methods and added JTreg test LGTM ------------- Marked as reviewed by iveresov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18496#pullrequestreview-1966588410 From chagedorn at openjdk.org Thu Mar 28 16:16:33 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 16:16:33 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class [v2] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 15:14:01 GMT, Roland Westrelin wrote: > Do you really need to special case that one? Wouldn't the CastPP constant fold if improved_klass_ptr_type is a singleton? For `Y <: abstract X`, when `superklass` is a precise constant `X` and `improved_klass_ptr_type` is a precise constant `Y`, then `CastPP(nullptr, superklass, improved_klass_ptr_type)` will be replaced by `top` and assigned to `improved_superklass` which is probably not what we want. That's why I special cased the singleton case. This could happen, for example, with: obj = (X)o; // superklass = ConP #precise X, improved_klass_ptr_type = precise Y > Also there is the question of whether a CastPP or a CheckCastPP should be used here and if it matters. They don't differ that much anymore so I suppose CastPP is ok. Okay, sound good. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1543245301 From chagedorn at openjdk.org Thu Mar 28 16:17:34 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 28 Mar 2024 16:17:34 GMT Subject: RFR: 8328938: C2 SuperWord: disable vectorization for large stride and scale [v3] In-Reply-To: References: <9oXO4yuvZbpAxofIUBGVwJ2WyBLPWcP2IHxqZg5nQNQ=.f8f9365c-56c5-4fa9-8075-880f432ac214@github.com> Message-ID: On Thu, 28 Mar 2024 15:37:56 GMT, Emanuel Peter wrote: >> **Problem** >> In [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) / https://git.openjdk.org/jdk/pull/14785 I fixed the alignment with `AlignVector`. For that, I had to compute `abs(scale)` and `abs(stride)`, as well as `scale * stride`. The issue is that all of these values can overflow the int range (e.g. `abs(min_int) = min_int`). >> >> We hit asserts like: >> >> `# assert(is_power_of_2(value)) failed: value must be a power of 2: 0xffffffff80000000` >> Happens because we take `abs(min_int)`, which is `min_int = 0x80000000`, and assuming this was a positive (unsigned) number is a power of 2 `2^31`. We then expand it to `long`, get `0xffffffff80000000`, which is not a power of 2 anymore. This violates the implicit assumptions, and we hit the assert. >> >> `# assert(q >= 1) failed: modulo value must be large enough` >> We have `scale = 2^30` and `stride = 4 = 2^2`. For the alignment calculation we compute `scale * stride = 2^32`, which overflows the int range and becomes zero. >> >> Before [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190) we could get similar issues with the (old) code in `SuperWord::ref_is_alignable`, if `AlignVector` is enabled: >> >> >> int span = preloop_stride * p.scale_in_bytes(); >> ... >> if (vw % span == 0) { >> >> >> if `span == 0` because of overflow, then the `idiv` from the modulo gets a division by zero -> `SIGFPE`. >> >> But it seems the bug is possibly a regression from JDK20 b2 [JDK-8286197](https://bugs.openjdk.org/browse/JDK-8286197). Here we enabled certaint Unsafe memory access address patterns, and it is such patterns that the reproducer requires. >> >> **Solution** >> I could either patch up all the code that works with `scale` and `stride`, and make sure no overflows ever happen. But that is quite involved and error prone. >> >> I now just disable vectorization for large `scale` and `stride`. This should not have any performance impact, because such large `scale` and `stride` would lead to highly inefficient memory accesses, since they are spaced very far apart. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18485#pullrequestreview-1966660490 From roland at openjdk.org Thu Mar 28 16:27:32 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 16:27:32 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class [v2] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 16:14:03 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/graphKit.cpp line 3364: >> >>> 3362: if (improved_klass_ptr_type != klass_ptr_type) { >>> 3363: if (improved_klass_ptr_type->singleton()) { >>> 3364: improved_superklass = makecon(improved_klass_ptr_type); >> >> Do you really need to special case that one? Wouldn't the CastPP constant fold if `improved_klass_ptr_type` is a singleton? >> Also there is the question of whether a `CastPP` or a `CheckCastPP` should be used here and if it matters. They don't differ that much anymore so I suppose `CastPP` is ok. > >> Do you really need to special case that one? Wouldn't the CastPP constant fold if improved_klass_ptr_type is a singleton? > > For `Y <: abstract X`, when `superklass` is a precise constant `X` and `improved_klass_ptr_type` is a precise constant `Y`, then `CastPP(nullptr, superklass, improved_klass_ptr_type)` will be replaced by `top` and assigned to `improved_superklass` which is probably not what we want. That's why I special cased the singleton case. This could happen, for example, with: > > obj = (X)o; // superklass = ConP #precise X, improved_klass_ptr_type = precise Y > > >> Also there is the question of whether a CastPP or a CheckCastPP should be used here and if it matters. They don't differ that much anymore so I suppose CastPP is ok. > > Okay, sound good. Ok. Makes sense. Thanks for the explanation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1543265259 From roland at openjdk.org Thu Mar 28 16:46:32 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 28 Mar 2024 16:46:32 GMT Subject: RFR: 8328480: C2: SubTypeCheckNode in checkcast should use the klass constant of a unique concrete sub class [v2] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 16:24:48 GMT, Roland Westrelin wrote: >>> Do you really need to special case that one? Wouldn't the CastPP constant fold if improved_klass_ptr_type is a singleton? >> >> For `Y <: abstract X`, when `superklass` is a precise constant `X` and `improved_klass_ptr_type` is a precise constant `Y`, then `CastPP(nullptr, superklass, improved_klass_ptr_type)` will be replaced by `top` and assigned to `improved_superklass` which is probably not what we want. That's why I special cased the singleton case. This could happen, for example, with: >> >> obj = (X)o; // superklass = ConP #precise X, improved_klass_ptr_type = precise Y >> >> >>> Also there is the question of whether a CastPP or a CheckCastPP should be used here and if it matters. They don't differ that much anymore so I suppose CastPP is ok. >> >> Okay, sound good. > > Ok. Makes sense. Thanks for the explanation. Then isn't there a risk that after some transformation the `CastPP` ends up with an input that's a constant superklass which would cause the `CastPP` to transform to top? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18515#discussion_r1543293569 From qamai at openjdk.org Thu Mar 28 17:30:46 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 28 Mar 2024 17:30:46 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v33] In-Reply-To: References: <0D9E-3Nj0VvCYUmIXKgMoRI7W3xioc6n5phQ_TGNHRE=.80f0ef3a-243d-4eea-9351-c407ed92b6b8@github.com> Message-ID: On Thu, 25 Jan 2024 12:00:30 GMT, Emanuel Peter wrote: >> @rgiulietti Thanks very much for your reviews >> @vnkozlov @eme64 Could you do another round of reviews, please? There has not been much change, though. > > @merykitty I see there are some proofs now, great! I'll have a look soon :) @eme64 Gentle ping on this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/9947#issuecomment-2025750010 From sviswanathan at openjdk.org Thu Mar 28 18:02:32 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 28 Mar 2024 18:02:32 GMT Subject: RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v4] In-Reply-To: <-niLtlGA1SlJaGnF7lPakY0eXX1njpTFpuTuWslhMMc=.846d48d4-74ee-43cf-9b95-9665f5fe0816@github.com> References: <8HUy9c75ZHxW1CyRw6J-xXBPNbtB7wrxoM6rha0ftNU=.d25e5130-84c2-4046-b357-eacfe6caedc0@github.com> <-niLtlGA1SlJaGnF7lPakY0eXX1njpTFpuTuWslhMMc=.846d48d4-74ee-43cf-9b95-9665f5fe0816@github.com> Message-ID: On Thu, 28 Mar 2024 00:45:33 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is improve the performance of convert instructions and address the slowdown when AVX>0 is used. >> >> The performance data using the ComputePI.java benchmark (part of this PR) is as follows: >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> Benchmark (ns/op) | Stock JDK | This PR (AVX=3) | Speedup >> -- | -- | -- | -- >> ComputePI.compute_pi_dbl_flt | 511.34 | 510.989 | 1.0 >> ComputePI.compute_pi_flt_dbl | 2024.06 | 518.695 | 3.9 >> ComputePI.compute_pi_int_dbl | 695.482 | 453.054 | 1.5 >> ComputePI.compute_pi_int_flt | 799.268 | 449.83 | 1.8 >> ComputePI.compute_pi_long_dbl | 802.992 | 454.891 | 1.8 >> ComputePI.compute_pi_long_flt | 628.62 | 463.617 | 1.4 >> >> >> >> >> >> >> >> > xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> >> >> >> >> >>
>> >> >> Benchmark (ns/op) | Stock JDK | This PR (AVX=0) | Speedup >> -- | -- | -- | -- >> ComputePI.compute_pi_dbl_flt | 473.778 | 472.529 | 1.0 >> ComputePI.compute_pi_flt_dbl | 536.004 | 538.418 | 1.0 >> ComputePI.compute_pi_int_dbl | 458.08 | 460.245 | 1.0 >> ComputePI.compute_pi_int_flt | 477.305 | 476.975 | 1.0 >> ComputePI.compute_pi_long_dbl | 455.132 | 455.064 | 1.0 >> ComputePI.compute_pi_long_flt | 474.734 | 476.571 | 1.0 >> >> >> >>
>> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > fix L2F cvtsi2ssq Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18503#pullrequestreview-1966904622 From qamai at openjdk.org Thu Mar 28 18:13:45 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 28 Mar 2024 18:13:45 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v49] In-Reply-To: References: Message-ID: On Wed, 20 Mar 2024 12:02:06 GMT, Raffaello Giulietti wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> fix tests > > src/hotspot/share/opto/divnode.cpp line 245: > >> 243: // to rounding down, now it is guaranteed to be correct, according to >> 244: // N-Bit Unsigned Division Via N-Bit Multiply-Add by Arch D. Robison >> 245: magic_divide_constants_round_down(divisor, magic_const, shift_const); > > I think there's no need for `magic_divide_constants_round_down()`. > > Firstly, we recover the previous value of `magic_const` and `shift_const`, just before the overflow: > > magic_const = magic_const + 1 >> 1 | 0x8000_0000; > shift_const -= 1; > > Then we decrement `magic_const` by one: > > magic_const -= 1; > > That's it. > If desired, we can additionally reduce `magic_const` to an odd value by right-shifting it by the number of trailing zero bits, and updating `shift_const` accordingly. But for the usage here, I think it makes no real difference. > > We can thus avoid the division in `magic_divide_constants_round_down()`, and can get rid of the method altogether, as it seems to be used only here. > This might even open similar code for the `julong` case. @rgiulietti I think that is a brilliant observation. However, I think that doing so would not reduce the code complexity and it also puts another burden of proof on us for the mathematical lemma. Regarding the `julong` case. rounding down is interesting, as we can emit code like this: addq dividend, 1 je overflow ... (the remaining calculation) And if `dividend` is known to be different from `max_julong` then the `je` can be removed. As a result, I think it would be possible to address your idea in a next RFE. As I think the patch is dense enough in mathematical proofs. Thanks a lot. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1543415382 From kvn at openjdk.org Thu Mar 28 18:45:30 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 28 Mar 2024 18:45:30 GMT Subject: RFR: 8329174: update CodeBuffer layout in comment after constants section moved In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 02:50:40 GMT, lusou-zhangquan wrote: > Enhancement [JDK-6961697](https://bugs.openjdk.org/browse/JDK-6961697) moved nmethod constants section before instruction section, but the layout scheme in codeBuffer.cpp was not changed correspondingly. The mismatch between layout scheme in source code and actual layout is misleading, so we'd better fix it. Trivial. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18529#pullrequestreview-1966996552 From kvn at openjdk.org Thu Mar 28 19:16:32 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 28 Mar 2024 19:16:32 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v4] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 12:08:01 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Compile native methods like trivial methods and added JTreg test I submitted our testing for v03 version. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18496#issuecomment-2025928218 From dlong at openjdk.org Thu Mar 28 19:20:32 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 28 Mar 2024 19:20:32 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: References: Message-ID: <-oedsnRFZWnYi1gaAx92K6dsKuxTDeqIZ_Kvt5Fr6sw=.8c74b016-1a1f-45a3-9224-cc0c56d56906@github.com> On Wed, 27 Mar 2024 09:32:29 GMT, Emanuel Peter wrote: >> SUN Guoyun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: >> >> - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode >> - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode > > A possible counter-example: > > > x1 = something > y1 = someCall > > for (int i = 0; i < a.length; i++) { > a[i] = (x + 1) + y) + ((x + 2) + y) + ((x + 2) + y) + ((x + 3) + y) + ((x + 4) + y) > } > > The call is outside the loop, so folding would not be costly at all. And I fear that the 4 terms would not common up, and so be slower after your change. And I think there are probably other examples. But I have not benchmarked anything, so I could be quite wrong. > > What exactly is it that gives you the speedup in your benchmark? Spilling? Fewer add instructions? Would be nice to understand that better, and see what are potential examples where we would have regressions with your patch. Why only (x+1)+y and not also x+(y+1)? I agree with @eme64 about breaking other optimizations if we don't move constants to the right. (x+1)+(y+1) no longer becomes (x+y)+2. To reduce spills, I would think we would want to move Calls to the left. (x+1)+y and x+(y+1) both become y+x+1. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18482#issuecomment-2025935270 From dlong at openjdk.org Thu Mar 28 20:43:31 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 28 Mar 2024 20:43:31 GMT Subject: RFR: 8300148: Consider using a StoreStore barrier instead of Release barrier on ctor exit [v3] In-Reply-To: References: Message-ID: On Wed, 27 Mar 2024 22:28:26 GMT, Chen Liang wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> EA tests, static test classes, add @requires, fix comment > > I heard rumors that storeStore is only safe for the scenarios where the constructor doesn't read its already assigned final fields; so if we have something like > > class Sample { > final int a, b; > Sample(int v) { > this.a = v; > this.b = this.a + 1; // performs read instance field before publication > } > } > > then we still need a regular release barrier. > > Am I correct here? @liach, in your example, I don't see an issue, unless "this" somehow escaped and allowed another thread to write to this.a, which would be difficult considering it is "final". This earlier discussion might help: https://mail.openjdk.org/pipermail/jmm-dev/2016-November/000381.html ------------- PR Comment: https://git.openjdk.org/jdk/pull/18505#issuecomment-2026077050 From dlong at openjdk.org Thu Mar 28 22:57:32 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 28 Mar 2024 22:57:32 GMT Subject: RFR: 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode [v2] In-Reply-To: References: Message-ID: <7BQsktD5eUnjkE1Ujadh1eiyiMPCNhWDBgzU3xrdmnk=.2c593179-c41b-4dd5-8344-c9a6fb3e9ee2@github.com> On Wed, 27 Mar 2024 08:45:55 GMT, SUN Guoyun wrote: >> This patch prohibits the conversion from "(x+1)+y" into "(x+y)+1" when y is a CallNode to reduce unnecessary spillcode and ADDNode. >> >> Testing: tier1-3 in x86_64 and LoongArch64 >> >> JMH in x86_64: >>
>> before:
>> Benchmark           Mode  Cnt      Score   Error  Units
>> CallNode.test      thrpt    2  26397.733          ops/s
>> 
>> after:
>> Benchmark           Mode  Cnt      Score   Error  Units
>> CallNode.test      thrpt    2  27839.337          ops/s
>> 
> > SUN Guoyun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode > - 8328865: [c2] No need to convert "(x+1)+y" into "(x+y)+1" when y is a CallNode The difference in generate code above seems very x86-specific, depending on differences between 2-operand ADD vs 3-operand LEA. It's not obvious to me why this change would necessarily generate better code, even in some cases. And the requirement should be leaning towards better or no worse code in all cases on all platforms. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18482#issuecomment-2026274309 From kvn at openjdk.org Thu Mar 28 23:21:31 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 28 Mar 2024 23:21:31 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v4] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 12:08:01 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Compile native methods like trivial methods and added JTreg test My tier1-3,xcomp,stress testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18496#pullrequestreview-1967526534 From fyang at openjdk.org Fri Mar 29 02:21:31 2024 From: fyang at openjdk.org (Fei Yang) Date: Fri, 29 Mar 2024 02:21:31 GMT Subject: RFR: 8317720: RISC-V: Implement Adler32 intrinsic In-Reply-To: References: Message-ID: <6VkrGNhwJb0yXsm9qgOicZ8aiHkHnX7dynR1TrXOp5A=.f2c9d34b-1e1c-4670-b0a0-33d46c2781fd@github.com> On Tue, 19 Mar 2024 17:03:26 GMT, ArsenyBochkarev wrote: > Hello everyone! Please review this non-vectorized implementation of `_updateBytesAdler32` intrinsic. Reference implementation for AArch64 can be found [here](https://github.com/openjdk/jdk9/blob/master/hotspot/src/cpu/aarch64/vm/stubGenerator_aarch64.cpp#L3281). > > ### Correctness checks > > Test `test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java` is ok. All tier1 also passed. > > ### Performance results on T-Head board > > Enabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- | > | Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 5522.693 | 23.387 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 3430.761 | 9.210 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 1962.888 | 5.323 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 1050.938 | 0.144 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 549.227 | 0.375 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 280.829 | 0.170 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 116.333 | 0.057 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 71.392 | 0.060 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 35.784 | 0.019 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 17.924 | 0.010 | ops/ms | > | Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 8.940 | 0.003 | ops/ms | > > Disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- | > |Adler32.TestAdler32.testAdler32Update|64|thrpt|25|655.633|5.845|ops/ms| > |Adler32.TestAdler32.testAdler32Update|128|thrpt|25|587.418|10.062|ops/ms| > |Adler32.TestAdler32.testAdler32Update|256|thrpt|25|546.675|11.598|ops/ms| > |Adler32.TestAdler32.testAdler32Update|512|thrpt|25|432.328|11.517|ops/ms| > |Adler32.TestAdler32.testAdler32Update|1024|thrpt|25|311.771|4.238|ops/ms| > |Adler32.TestAdler32.testAdler32Update|2048|thrpt|25|202.648|2.486|ops/ms| > |Adler32.TestAdler32.testAdler32Update|5012|thrpt|25|100.246|1.119|ops/ms| > |Adler32.TestAdler32.testAdler32Update|8192|thr... Sorry for the late reply. Some comments after a brief look. BTW: Could you please merge master to retrigger the GHA? Thanks. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5086: > 5084: const uint64_t BASE = 0xfff1; > 5085: const uint64_t NMAX = 0x15B0; > 5086: I think it's better to start a new frame on stub enter and exit with `__ enter()` and `__ leave()` respectively for proper stackwalking of RuntimeStub frame. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5160: > 5158: __ srli(temp2, temp0, 56); > 5159: __ add(s1, s1, temp2); > 5160: __ add(s2, s2, s1); I see a lot of duplicate logic in this function. Can we factor out some common logic as separate functions? Like generate_updateBytesAdler32_accum_16, generate_updateBytesAdler32_accum_8, etc. ------------- PR Review: https://git.openjdk.org/jdk/pull/18382#pullrequestreview-1967697752 PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1543964079 PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1543958995 From vlivanov at openjdk.org Fri Mar 29 02:51:34 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 29 Mar 2024 02:51:34 GMT Subject: RFR: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 [v4] In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 12:08:01 GMT, Volker Simonis wrote: >> Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). >> >> The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: >> >> public static native void emptyStaticNativeMethod(); >> >> @Benchmark >> public static void baseline() { >> } >> >> @Benchmark >> public static void staticMethodCallingStatic() { >> emptyStaticMethod(); >> } >> >> @Benchmark >> public static void staticMethodCallingStaticNative() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:-TieredCompilation") >> public static void staticMethodCallingStaticNativeNoTiered() { >> emptyStaticNativeMethod(); >> } >> >> @Benchmark >> @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") >> public static void staticMethodCallingStaticNativeIntStub() { >> emptyStaticNativeMethod(); >> } >> >> >> JDK 11 >> ====== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op >> >> >> JDK 17 & 21 >> =========== >> >> Benchmark Mode Cnt Score Error Units >> NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op >> NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op >> NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op >> NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op >> NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op >> >> >> The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: >> >> 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) >> @ 0 io.simonis.NativeCall::emptyStaticNa... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Compile native methods like trivial methods and added JTreg test Good catch! Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18496#pullrequestreview-1967734109 From kvn at openjdk.org Fri Mar 29 19:51:32 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 29 Mar 2024 19:51:32 GMT Subject: RFR: 8329174: update CodeBuffer layout in comment after constants section moved In-Reply-To: References: Message-ID: On Thu, 28 Mar 2024 02:50:40 GMT, lusou-zhangquan wrote: > Enhancement [JDK-6961697](https://bugs.openjdk.org/browse/JDK-6961697) moved nmethod constants section before instruction section, but the layout scheme in codeBuffer.cpp was not changed correspondingly. The mismatch between layout scheme in source code and actual layout is misleading, so we'd better fix it. Can we wait until #18554 is pushed? I don't want to introduce conflict. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18529#issuecomment-2027673630 From fyang at openjdk.org Sat Mar 30 08:53:42 2024 From: fyang at openjdk.org (Fei Yang) Date: Sat, 30 Mar 2024 08:53:42 GMT Subject: RFR: 8329355: Test compiler/c2/irTests/TestIfMinMax.java fails on RISC-V Message-ID: <6ray-riEC6nAkcgIDes3YrwdJOhNVxm4NY5RXzYzwaE=.a57c4f75-3ac9-456e-a39e-58f7adcd4cb3@github.com> Please review this small change fixing an IR matching failure on linux-riscv platform. JDK-8324655 tries to identify min/max patterns in CMoves and transform them into Min and Max nodes. But architectures like RISC-V doesn't have support of conditional moves at the ISA level for now. So we set ConditionalMoveLimit parameter to 0 for this platform and conditionals moves are emulated with normal compare and branch instructions instead [1]. This is why the IR matching test added by JDK-8324655 fails on this platform. A simple way to fix this would be skip this test for this case. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L9775 ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/18558/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18558&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329355 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18558.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18558/head:pull/18558 PR: https://git.openjdk.org/jdk/pull/18558 From simonis at openjdk.org Sat Mar 30 12:50:37 2024 From: simonis at openjdk.org (Volker Simonis) Date: Sat, 30 Mar 2024 12:50:37 GMT Subject: Integrated: 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 In-Reply-To: References: Message-ID: On Tue, 26 Mar 2024 19:14:23 GMT, Volker Simonis wrote: > Since [JDK-8251462: Simplify compilation policy](https://bugs.openjdk.org/browse/JDK-8251462), introduced in JDK 17, no native wrappers are generated any more if running with `-XX:-TieredCompilation` (i.e. native methods are not compiled any more). > > The attached JMH benchmark demonstrate that native method calls became twice as expensive with JDK 17: > > public static native void emptyStaticNativeMethod(); > > @Benchmark > public static void baseline() { > } > > @Benchmark > public static void staticMethodCallingStatic() { > emptyStaticMethod(); > } > > @Benchmark > public static void staticMethodCallingStaticNative() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:-TieredCompilation") > public static void staticMethodCallingStaticNativeNoTiered() { > emptyStaticNativeMethod(); > } > > @Benchmark > @Fork(jvmArgsAppend = "-XX:+PreferInterpreterNativeStubs") > public static void staticMethodCallingStaticNativeIntStub() { > emptyStaticNativeMethod(); > } > > > JDK 11 > ====== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.016 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.693 ? 0.053 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.287 ? 0.754 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 9.966 ? 0.248 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 20.384 ? 0.444 ns/op > > > JDK 17 & 21 > =========== > > Benchmark Mode Cnt Score Error Units > NativeCall.baseline avgt 5 0.390 ? 0.017 ns/op > NativeCall.staticMethodCallingStatic avgt 5 1.852 ? 0.272 ns/op > NativeCall.staticMethodCallingStaticNative avgt 5 10.648 ? 0.661 ns/op > NativeCall.staticMethodCallingStaticNativeNoTiered avgt 5 20.657 ? 1.084 ns/op > NativeCall.staticMethodCallingStaticNativeIntStub avgt 5 22.429 ? 0.991 ns/op > > > The issue can bee seen if we run with `-XX:+PrintCompilation -XX:+PrintInlining`. With JDK 11 we get the following output for `-XX:+TieredCompilation`: > > 172 111 b 3 io.simonis.NativeCall::staticMethodCallingStaticNative (4 bytes) > @ 0 io.simonis.NativeCall::emptyStaticNativeMethod (0 bytes) native method > 172 112 n 0 io.simonis.NativeCall::emptyStaticNativeMethod (native... This pull request has now been integrated. Changeset: f2e5808b Author: Volker Simonis URL: https://git.openjdk.org/jdk/commit/f2e5808b46a3da6920dd56688c877ee0e2795de6 Stats: 115 lines in 3 files changed: 114 ins; 0 del; 1 mod 8329126: No native wrappers generated anymore with -XX:-TieredCompilation after JDK-8251462 Reviewed-by: kvn, iveresov, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/18496 From jkarthikeyan at openjdk.org Sat Mar 30 14:55:30 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sat, 30 Mar 2024 14:55:30 GMT Subject: RFR: 8329355: Test compiler/c2/irTests/TestIfMinMax.java fails on RISC-V In-Reply-To: <6ray-riEC6nAkcgIDes3YrwdJOhNVxm4NY5RXzYzwaE=.a57c4f75-3ac9-456e-a39e-58f7adcd4cb3@github.com> References: <6ray-riEC6nAkcgIDes3YrwdJOhNVxm4NY5RXzYzwaE=.a57c4f75-3ac9-456e-a39e-58f7adcd4cb3@github.com> Message-ID: On Sat, 30 Mar 2024 08:49:00 GMT, Fei Yang wrote: > Please review this small change fixing an IR matching failure on linux-riscv platform. > > JDK-8324655 tries to identify min/max patterns in CMoves and transform them into Min and Max nodes. > But architectures like RISC-V doesn't have support of conditional moves at the ISA level for now. > So we set ConditionalMoveLimit parameter to 0 for this platform and conditionals moves are emulated > with normal compare and branch instructions instead [1]. This is why the IR matching test added by > JDK-8324655 fails on this platform. A simple way to fix this would be skip this test for this case. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L9775 Thank you for fixing this! It looks good to me. ------------- Marked as reviewed by jkarthikeyan (Author). PR Review: https://git.openjdk.org/jdk/pull/18558#pullrequestreview-1969966657 From duke at openjdk.org Sun Mar 31 08:15:12 2024 From: duke at openjdk.org (Joshua Cao) Date: Sun, 31 Mar 2024 08:15:12 GMT Subject: RFR: 8325674: Constant fold across compares [v4] In-Reply-To: References: Message-ID: > For example, `x + 1 < 2` -> `x < 2 - 1` iff we can prove that `x + 1` does not overflow and `2 - 1` does not overflow. We can always fold if it is an `==` or `!=` since overflow will not affect the result of the comparison. > > Consider this more practical example: > > > public void foo(int[] arr) { > for (i = arr.length - 1; i >= 0; --i) { > blackhole(arr[i]); > } > } > > > C2 emits a loop guard that looks `arr.length - 1 < 0`. We know `arr.length - 1` does not overflow because `arr.length` is positive. We can fold the comparison into `arr.length < 1`. We have to compute `arr.length - 1` computation if we enter the loop anyway, but we can avoid the subtraction computation if we never enter the loop. I believe the simplification can also help with stronger integer range analysis in https://bugs.openjdk.org/browse/JDK-8275202. > > Some additional notes: > * there is various overflow checking code across `src/hotspot/share/opto`. I separated about the functions from convertnode.cpp into `type.hpp`. Maybe the functions belong somewhere else? > * there is a change in Parse::do_if() to repeatedly apply GVN until the test is canonical. We need multiple iterations in the case of `C1 > C2 - X` -> `C2 - X < C1` -> `C2 < X` -> `X > C2`. This fails the assertion if `BoolTest(btest).is_canonical()`. We can avoid this by applying GVN one more time to get `C2 < X`. > * we should not transform loop backedge conditions. For example, if we have `for (i = 0; i < 10; ++i) {}`, the backedge condition is `i + 1 < 10`. If we transform it into `i < 9`, it messes with CountedLoop's recognition of induction variables and strides.r > * this change optimizes some of the equality checks in `TestUnsignedComparison.java` and breaks the IR checks. I removed those tests. Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Combine add/sub folding code together. Assertions for invalid opcodes. - Merge branch 'master' into cmpconstantfold - Merge branch 'master' into cmpconstantfold - comments with explanations and style changes - Modify tests to work with -XX:-TieredCompilation - Merge branch 'master' into cmpconstantfold - 8325674: Constant fold across compares ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17853/files - new: https://git.openjdk.org/jdk/pull/17853/files/f1b540eb..fe2c95fc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17853&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17853&range=02-03 Stats: 476048 lines in 5044 files changed: 45238 ins; 99997 del; 330813 mod Patch: https://git.openjdk.org/jdk/pull/17853.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17853/head:pull/17853 PR: https://git.openjdk.org/jdk/pull/17853 From duke at openjdk.org Sun Mar 31 08:15:13 2024 From: duke at openjdk.org (Joshua Cao) Date: Sun, 31 Mar 2024 08:15:13 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: <1ERbdsMnJfv1CF5nPhsf4pFxhGcU7afzxXgqOLBC_eA=.ea0663ee-3f48-4f86-8eb6-9a3a32cac3ab@github.com> On Fri, 8 Mar 2024 18:33:53 GMT, Joshua Cao wrote: >> src/hotspot/share/opto/subnode.cpp line 1586: >> >>> 1584: } >>> 1585: } >>> 1586: } >> >> This looks like heavy code duplication. Can you refactor this? Maybe a helper method? > > I can post a version of this so we can see what it looks like. I actually did this first, but the code got quite ugly. Pushed changes that combine this. I don't think its better or worse, just different. Open to suggestions. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1545576059 From duke at openjdk.org Sun Mar 31 08:15:13 2024 From: duke at openjdk.org (Joshua Cao) Date: Sun, 31 Mar 2024 08:15:13 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 08:29:10 GMT, Emanuel Peter wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> comments with explanations and style changes > > src/hotspot/share/opto/type.cpp line 1761: > >> 1759: } >> 1760: return true; >> 1761: } > > Do you maybe want to assert that no other opcode comes in? Or is there a need for non add/sub opcodes to be passed in? Added assertions ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1545575785 From duke at openjdk.org Sun Mar 31 15:47:51 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Sun, 31 Mar 2024 15:47:51 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v10] In-Reply-To: References: Message-ID: <8FMdThon_sV2dJ6P5AnRzvQ8eXqfEumA2sX2TY4Z_GA=.eb31062e-8bcf-45cf-9769-8080df213ff6@github.com> > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ArsenyBochkarev has updated the pull request incrementally with one additional commit since the last revision: Use srliw to clear upper bits for 'lower' cases ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17046/files - new: https://git.openjdk.org/jdk/pull/17046/files/654b25b7..64abc7b4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=08-09 Stats: 12 lines in 1 file changed: 3 ins; 5 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From duke at openjdk.org Sun Mar 31 15:47:52 2024 From: duke at openjdk.org (ArsenyBochkarev) Date: Sun, 31 Mar 2024 15:47:52 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v9] In-Reply-To: References: Message-ID: <2FdwQ54ts6nOj8JCcEgEr2OeXwTyGqdz0HSgcqKJe0M=.15e72a0d-eb39-4fe1-96b2-f5cea1ba85bb@github.com> On Tue, 19 Mar 2024 11:16:42 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with one additional commit since the last revision: > > Optimize last 'upper' load in update_word_crc32 More accel for `-XX:+UseZba` case (on StarFive VisionFive2): | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | ------------------------------ | ----------- | ---------- | ----- | --------- | -------- | --------- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4415.416 | 2.822 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2756.321 | 0.769 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1450.461 | 4.115 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 750.851 | 0.496 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 192.352 | 0.599 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 24.209 | 0.044 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.670 | 0.015 | ops/ms | ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-2028802151