From duke at openjdk.org Fri Mar 1 00:27:53 2024 From: duke at openjdk.org (Joshua Cao) Date: Fri, 1 Mar 2024 00:27:53 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v5] In-Reply-To: References: <0mSC33e8Dm1pwOo_xlx48AwfkB1C9ZNIVqD8UdSW07U=.866a7c2a-59cf-4bab-8bda-dcd8a3f337de@github.com> Message-ID: On Thu, 29 Feb 2024 07:26:52 GMT, Emanuel Peter wrote: > One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. I ran `make CONF=linux-x86_64-server-fastdebug test TEST=all TEST_VM_OPTS=-XX:-TieredCompilation` on my Linux machine. I have 4 failures in `SctpChannel` and 3 failures in `CAInterop.java`, but they also fail on master branch so they should not be caused by this patch. Hopefully this adds a little more confidence. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17375#issuecomment-1972204471 From duke at openjdk.org Fri Mar 1 05:38:53 2024 From: duke at openjdk.org (Joshua Cao) Date: Fri, 1 Mar 2024 05:38:53 GMT Subject: Integrated: 8324790: ifnode::fold_compares_helper cleanup In-Reply-To: References: Message-ID: On Fri, 26 Jan 2024 23:31:00 GMT, Joshua Cao wrote: > I hope my assumptions in `filtered_int_type` are correct here: > > * we assert that `if_proj` is an `IfTrue` or `IfFalse`, so it is safe to assume `if_proj->_in` is an `IfNode` > * the 1'th input of a CmpNode is a BoolNode > * Tthe 1'th input of an IfNode is **not always a BoolNode**, it can be a constant. We need to leave this check in. > > We also remove a some of the if-checks in `compare_folds_cleanup` which seem unnecessary. > > Passes tier1 locally. This pull request has now been integrated. Changeset: 12404a5e Author: Joshua Cao Committer: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/12404a5efb3c45f72f54fda3238c72d5d15a30ee Stats: 64 lines in 1 file changed: 21 ins; 27 del; 16 mod 8324790: ifnode::fold_compares_helper cleanup Reviewed-by: chagedorn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/17601 From epeter at openjdk.org Fri Mar 1 05:43:53 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 05:43:53 GMT Subject: RFR: 8323220: Reassociate loop invariants involved in Cmps and Add/Subs [v5] In-Reply-To: References: <0mSC33e8Dm1pwOo_xlx48AwfkB1C9ZNIVqD8UdSW07U=.866a7c2a-59cf-4bab-8bda-dcd8a3f337de@github.com> Message-ID: On Fri, 1 Mar 2024 00:24:45 GMT, Joshua Cao wrote: >> @caojoshua it looks really good now! >> I'm running our internal testing again, will report back. >> >> One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? Because you now touched the logic around there we should make sure there are at least correctness tests. IR tests are probably basically impossible because it is the same number of Add/Sub nodes before and after the optimization. >> >> I'd like to have another Reviewer look over this as well, therefore: > >> One more concern I just had: do we have tests for the pre-existing Add/Sub reassociations? > > Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. > > I ran `make CONF=linux-x86_64-server-fastdebug test TEST=all TEST_VM_OPTS=-XX:-TieredCompilation` on my Linux machine. I have 4 failures in `SctpChannel` and 3 failures in `CAInterop.java`, but they also fail on master branch so they should not be caused by this patch. Hopefully this adds a little more confidence. @caojoshua I also ran our internal testing and it looks ok (only unrelated failures). But of course that is only on tests that we have, and if the other reassociations are not tested, then that helps little ;) > Not that I know of. A bunch of reassociation was added in https://github.com/openjdk/jdk/commit/23ed3a9e91ac57295d274fefdf6c0a322b1e87b7, which does not have any tests. Could you please add a result verification test per case of pre-existing reassociation? Otherwise I'm afraid it is hard to be sure you did not break those cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17375#issuecomment-1972548903 From duke at openjdk.org Fri Mar 1 05:51:02 2024 From: duke at openjdk.org (kuaiwei) Date: Fri, 1 Mar 2024 05:51:02 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 Message-ID: Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. ------------- Commit messages: - 8326983: Unused operands reported after JDK-8326135 Changes: https://git.openjdk.org/jdk/pull/18075/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8326983 Stats: 553 lines in 2 files changed: 0 ins; 553 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From jbhateja at openjdk.org Fri Mar 1 06:01:45 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 06:01:45 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo src/hotspot/cpu/x86/assembler_x86.cpp line 5146: > 5144: > 5145: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, Address src2, int vector_len) { > 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); What if vector length is 128 bit and target does not support AVX_IFMA ? AVX512_IFMA + AVX512_VL should still be still be sufficient to execute 52 bit MACs. src/hotspot/cpu/x86/assembler_x86.cpp line 5181: > 5179: > 5180: void Assembler::vpmadd52huq(XMMRegister dst, XMMRegister src1, Address src2, int vector_len) { > 5181: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); What if vector length is 128 bit and target does not support AVX_IFMA ? AVX512_IFMA + AVX512_VL should still be still be sufficient to execute 52 bit MACs. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508515255 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508514777 From jbhateja at openjdk.org Fri Mar 1 06:07:56 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 06:07:56 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo src/hotspot/cpu/x86/assembler_x86.cpp line 5156: > 5154: > 5155: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { > 5156: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); What if vector length is 128 bit and target does not support AVX_IFMA ? AVX512_IFMA + AVX512_VL should still be still be sufficient to execute 52 bit MACs. Please add appropriate assertions to explicitly check AVX512VL. src/hotspot/cpu/x86/assembler_x86.cpp line 5191: > 5189: > 5190: void Assembler::vpmadd52huq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { > 5191: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); Same as above. src/hotspot/cpu/x86/assembler_x86.cpp line 9101: > 9099: > 9100: void Assembler::vpunpckhqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 9101: assert(UseAVX > 0, "requires some form of AVX"); Add appropriate AVX512VL assertion. src/hotspot/cpu/x86/assembler_x86.cpp line 9115: > 9113: > 9114: void Assembler::vpunpcklqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 9115: assert(UseAVX > 0, "requires some form of AVX"); Add appropriate AVX512VL assertion ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508516820 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508518721 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508517680 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508517933 From jbhateja at openjdk.org Fri Mar 1 06:11:56 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 06:11:56 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo Changes requested by jbhateja (Reviewer). src/hotspot/cpu/x86/assembler_x86.cpp line 9115: > 9113: > 9114: void Assembler::vpunpcklqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 9115: assert(UseAVX > 0, "requires some form of AVX"); Add appropriate AVX512VL assertion ------------- PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1910377861 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508520924 From duke at openjdk.org Fri Mar 1 07:39:16 2024 From: duke at openjdk.org (kuaiwei) Date: Fri, 1 Mar 2024 07:39:16 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: Message-ID: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. kuaiwei has updated the pull request incrementally with one additional commit since the last revision: clean for other architecture ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18075/files - new: https://git.openjdk.org/jdk/pull/18075/files/3efe6bb8..faa8f949 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=00-01 Stats: 403 lines in 7 files changed: 1 ins; 401 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From jbhateja at openjdk.org Fri Mar 1 08:23:54 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Mar 2024 08:23:54 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Tue, 27 Feb 2024 21:13:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Update description of Poly1305 algo Hi @vamsi-parasa , apart from above assertion check modifications, patch looks good to me. src/hotspot/cpu/x86/assembler_x86.cpp line 5148: > 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > 5147: InstructionMark im(this); > 5148: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); uses_vl should be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 5157: > 5155: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { > 5156: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > 5157: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); uses_vl should be false. src/hotspot/cpu/x86/assembler_x86.cpp line 5183: > 5181: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > 5182: InstructionMark im(this); > 5183: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); uses_vl should be false. ------------- Changes requested by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1910555763 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508637115 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508637945 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1508638146 From epeter at openjdk.org Fri Mar 1 12:43:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 12:43:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth Thanks for trying such a benchmark! I have a few ideas and questions now. 1. I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster. 2. You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic? 3. I think you want some code on the `a` path that does not require inlining, just some arithmetic. The longer the chain the better, as it creates large latency. But then you also want something after the if/min/max which has a high latency, so that branch speculation can actually make progress on something, whereas max/min would have to wait until it is finished computing. I actually have a r**egression case for the current CMove logic**, but it **would apply to your logic in some way I think as well**. See my `testCostDifference` below. Clean master: `IfMinMax.testCostDifference avgt 15 889118.284 ? 10638.421 ns/op` When I disable `PhaseIdealLoop::conditional_move`, without your patch: `IfMinMax.testCostDifference avgt 15 710629.583 ? 3232.237 ns/op` Your patch, with `PhaseIdealLoop::conditional_move` disabled: `IfMinMax.testCostDifference avgt 15 886518.663 ? 1801.308 ns/op` I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why. So I suspect you could now take my benchmark, and convert it into non-loop code, and then CMove would not kick in, but your conversion to Max/Min would apply instead. And then you could observe the same regression. Let me know what you think. Not sure if this regression is important enough, but we need to consider what to do about your patch, as well as the CMove logic that already exists. @Benchmark public void testCostDifference(Blackhole blackhole, BenchState state) { //int hits = 0; int x = 0xf0f0f0f0; // maybe instead use a random source that is different with every method call? for (int i = 0; i < 10_000; i++) { int a = (x ^ 0xffffffff) & 0x07ffffff; // cheap (note: mask affects probability) int h = x; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; int b = (h & 0x7fffffff); // expensive hashing sequence (node: mask affects probability) int m = (a < b) ? a : b; // The Min/Max //hits += (a > b) ? 1 : 0; //System.out.println("i: " + i + " hits: " + hits + " m: " + m + " a: " + a + " b: " + b); // Note: the hit probability can be adjusted by changing the masks // adding or removing the most significant bit has a change of // about a factor of 2. // The hashing sequences are there to be expensive, and to randomize the values a bit. h = m; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; h = (h << 6) + (h << 16) + (h >>> 18) - h; x = h; // another expensive hashing sequence } //System.out.println(10_000 / hits); blackhole.consume(x); } ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973121951 From jpai at openjdk.org Fri Mar 1 12:56:11 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Fri, 1 Mar 2024 12:56:11 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only Message-ID: Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. ------------- Commit messages: - 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only Changes: https://git.openjdk.org/jdk/pull/18078/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18078&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327108 Stats: 7 lines in 1 file changed: 3 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18078.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18078/head:pull/18078 PR: https://git.openjdk.org/jdk/pull/18078 From epeter at openjdk.org Fri Mar 1 13:03:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 13:03:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth The case of Min/Max style if-statements is that both the if and else branch are actually empty, since both values are computed before the if. That is why our `PhaseIdealLoop::conditional_move` will always say that it is profitable: it thinks there is zero cost in the if/else branch, so there is basically no cost. So this kind of cost-modeling based on the if/else blocks is really insufficient. Rather, you would have to know how much cost is behind the two inputs to the cmp. As we see in my example, the cost of `b` can basically be hidden by the branch predictor (at least a part of it). But a CMove/Min/Max has to pay the full cost of `b` before it can continue afterwards. @jaskarth My example is extreme. Feel free to play with my example, and make the `b` part and the "post" part smaller. Maybe there is a regression case that is less extreme. If we could show that only the really extreme examples lead to regressions, then maybe we are willing to bite the bullet on those regressions for the benefit of speedups in other cases. @jaskarth One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization? The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973149078 PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973153118 PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973156599 From epeter at openjdk.org Fri Mar 1 13:06:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 13:06:56 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: <-VecOge93qSNd6pqheFmyoBhkI0_Kkf8A2HN0aiZbqU=.4c6f56f8-0136-44ee-a880-000e826a0642@github.com> On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth now there are some platforms that have horrible branch predictors. On those the cost model would probably favor CMove and Min/Max in more cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973162293 From chagedorn at openjdk.org Fri Mar 1 13:08:52 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:08:52 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. That looks reasonable, thanks for fixing this! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18078#pullrequestreview-1911090301 From jpai at openjdk.org Fri Mar 1 13:12:01 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Fri, 1 Mar 2024 13:12:01 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only Message-ID: Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. ------------- Commit messages: - 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only Changes: https://git.openjdk.org/jdk/pull/18079/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18079&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327105 Stats: 7 lines in 2 files changed: 3 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18079.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18079/head:pull/18079 PR: https://git.openjdk.org/jdk/pull/18079 From epeter at openjdk.org Fri Mar 1 13:12:45 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 13:12:45 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth When I ran `make test TEST="micro:IfMinMax" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"` and checked the generated assembly, I did not find any vector instructions. Could it be that `SIZE=300` is too small? I generally use vector sizes in the range of `10_000`, just to make sure it vectorizes. Maybe it is because I have a avx512 machine with 64byte registers, compared to 32byte registers for AVX2? Not sure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1973170971 From chagedorn at openjdk.org Fri Mar 1 13:33:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:33:59 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class Message-ID: In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. #### Redo refactoring of `create_bool_from_template_assertion_predicate()` On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). #### Share data graph cloning code - start from existing code This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: 1. Collect data nodes to clone by using a node filter 2. Clone the collected nodes (their data and control inputs still point to the old nodes) 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. #### Shared data graph cloning class Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] `DataNodeGraph` can then later be reused in JDK-8327110 and JDK-8327111 to refactor `create_bool_from_template_assertion_predicate()`. Thanks to @eme64 for the comments in https://github.com/openjdk/jdk/pull/16877 and the joint effort to find a reproducer of the existing bug which was the main motivation to redo the refactoring. Thanks, Christian ------------- Commit messages: - 8327109: Refactor data graph cloning used for in create_new_if_for_predicate() into separate class Changes: https://git.openjdk.org/jdk/pull/18080/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327109 Stats: 135 lines in 3 files changed: 71 ins; 32 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From chagedorn at openjdk.org Fri Mar 1 13:34:00 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:34:00 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... src/hotspot/share/opto/loopPredicate.cpp line 220: > 218: void PhaseIdealLoop::set_ctrl_of_nodes_with_same_ctrl(Node* start_node, ProjNode* old_uncommon_proj, > 219: Node* new_uncommon_proj) { > 220: ResourceMark rm; Added `ResourceMark`s which I think is now safe after [JDK-8325672](https://bugs.openjdk.org/browse/JDK-8325672). src/hotspot/share/opto/loopPredicate.cpp line 250: > 248: DEBUG_ONLY(uint last_idx = C->unique();) > 249: Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(node, old_ctrl); > 250: Dict old_new_mapping = clone_nodes(nodes_with_same_ctrl); // Cloned but not rewired, yet Replaced `Dict` with `ResizeableResourceHashtable` which I think is preferable to use. src/hotspot/share/opto/loopnode.hpp line 1353: > 1351: void fix_cloned_data_node_controls( > 1352: const ProjNode* old_uncommon_proj, Node* new_uncommon_proj, > 1353: const ResizeableResourceHashtable& orig_to_new); Mostly some renaming and adding `const`. src/hotspot/share/opto/loopnode.hpp line 1899: > 1897: _data_nodes(data_nodes), > 1898: // Use 107 as best guess which is the first resize value in ResizeableResourceHashtable::large_table_sizes. > 1899: _orig_to_new(107, MaxNodeLimit) I'm not sure if this is the right default value - was just a best guess. We usually only have a small number of data nodes to copy. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509020430 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509011640 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509016113 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509014377 From chagedorn at openjdk.org Fri Mar 1 13:35:52 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Mar 2024 13:35:52 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. That looks reasonable. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18079#pullrequestreview-1911194233 From epeter at openjdk.org Fri Mar 1 14:14:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:47:40 GMT, Emanuel Peter wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > src/hotspot/share/opto/loopPredicate.cpp line 254: > >> 252: const Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(start_node, old_uncommon_proj); >> 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); >> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); > > This was a bit confusing. At first I thought you are cloning the `data_node_graph`, since the `auto` did not tell me that here we are getting a hash-table back. > I wonder if this cloning should be done in the constructor of `DataNodeGraph`. The beauty of packing it into the constructor is that you have fewer lines here. And that is probably beneficial if you are going to use the class elsewhere -> less code duplication. > src/hotspot/share/opto/loopnode.hpp line 1889: > >> 1887: // 1. Clone the data nodes >> 1888: // 2. Fix the cloned data inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. >> 1889: class DataNodeGraph : public StackObj { > > You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. > Suggestion: `OrigToNewHashtable`. The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509063868 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509048691 From epeter at openjdk.org Fri Mar 1 14:14:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:56:52 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.hpp line 1889: >> >>> 1887: // 1. Clone the data nodes >>> 1888: // 2. Fix the cloned data inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. >>> 1889: class DataNodeGraph : public StackObj { >> >> You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. >> Suggestion: `OrigToNewHashtable`. > > The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. Suggestion for better name `CloneDataNodeGraph`. Do you assert that only data nodes are cloned, and no CFG nodes? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509050141 From epeter at openjdk.org Fri Mar 1 14:14:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Looks like a nice refactoring! I left a few comments and questions :) src/hotspot/share/opto/loopPredicate.cpp line 254: > 252: const Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(start_node, old_uncommon_proj); > 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); > 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); This was a bit confusing. At first I thought you are cloning the `data_node_graph`, since the `auto` did not tell me that here we are getting a hash-table back. I wonder if this cloning should be done in the constructor of `DataNodeGraph`. src/hotspot/share/opto/loopPredicate.cpp line 255: > 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); > 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); > 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); And is there a reason why `fix_cloned_data_node_controls` is not part of the `DataNodeGraph` class? Is there any use of the class where we don't have to call `fix_cloned_data_node_controls`? src/hotspot/share/opto/loopPredicate.cpp line 256: > 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); > 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); > 256: Node** cloned_node_ptr = orig_to_new.get(start_node); Boah, this `**` is a bit nasty. Would have been nicer if there was a reference pass instead, which checks already that the element exists. src/hotspot/share/opto/loopPredicate.cpp line 265: > 263: void PhaseIdealLoop::fix_cloned_data_node_controls( > 264: const ProjNode* old_uncommon_proj, Node* new_uncommon_proj, > 265: const ResizeableResourceHashtable& orig_to_new) { Suggestion: const ResizeableResourceHashtable& orig_to_new) { This might also help with understanding the indentation. But this is a taste question for sure. src/hotspot/share/opto/loopPredicate.cpp line 271: > 269: set_ctrl(clone, new_uncommon_proj); > 270: } > 271: }); Indentation is suboptimal here. I found it difficult to read. Style guide: Indentation for multi-line lambda: c.do_entries([&] (const X& x) { do_something(x, a); do_something1(x, b); do_something2(x, c); }); src/hotspot/share/opto/loopPredicate.cpp line 291: > 289: for (uint i = 1; i < next->req(); i++) { > 290: Node* in = next->in(i); > 291: if (!in->is_Phi()) { What happened with the `is_Phi`? Is it not needed anymore? src/hotspot/share/opto/loopnode.hpp line 1889: > 1887: // 1. Clone the data nodes > 1888: // 2. Fix the cloned data inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. > 1889: class DataNodeGraph : public StackObj { You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. Suggestion: `OrigToNewHashtable`. src/hotspot/share/opto/loopnode.hpp line 1921: > 1919: rewire_clones_to_cloned_inputs(); > 1920: return _orig_to_new; > 1921: } Currently, it looks like one could call `clone` multiple times. But I think that would be wrong, right? That is why I'd put all the active logic in the constructor, and only the passive stuff is publicly accessible, with `const` to indicate that these don't have any effect. src/hotspot/share/opto/loopopts.cpp line 4519: > 4517: _orig_to_new.iterate_all([&](Node* node, Node* clone) { > 4518: for (uint i = 1; i < node->req(); i++) { > 4519: Node** cloned_input = _orig_to_new.get(node->in(i)); You don't need to check for `is_Phi` on `node->in(i)` anymore? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1911220168 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509038222 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509065385 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509040263 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509045154 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509044654 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509060128 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509047305 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509057459 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509060906 From epeter at openjdk.org Fri Mar 1 14:14:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 1 Mar 2024 14:14:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: <7kG-lqyCbXktVszrDRPmc2SkQYXxnqcG9HmCmR5YSCQ=.1bd20dcb-5285-4746-b579-502029c0d9bd@github.com> On Fri, 1 Mar 2024 13:58:09 GMT, Emanuel Peter wrote: >> The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. > > Suggestion for better name `CloneDataNodeGraph`. Do you assert that only data nodes are cloned, and no CFG nodes? Yes, you do verify it, great! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1509050744 From sviswanathan at openjdk.org Fri Mar 1 17:04:55 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 1 Mar 2024 17:04:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Fri, 1 Mar 2024 08:15:50 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Update description of Poly1305 algo > > src/hotspot/cpu/x86/assembler_x86.cpp line 5148: > >> 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >> 5147: InstructionMark im(this); >> 5148: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); > > uses_vl should be false here. > > BTW, this assertion looks very fuzzy, you are checking for two target features in one instruction, apparently, instruction is meant to use AVX512_IFMA only for 512 bit vector length, and for narrower vectors its needs AVX_IFMA. > > Lets either keep this strictly for AVX_IFMA for AVX512_IFMA we already have evpmadd52[l/h]uq, if you truly want to make this generic one then split the assertion > > `assert ( (avx_ifma && vector_len <= 256) || (avx512_ifma && (vector_len == 512 || VM_Version::support_vl())); > ` > > And then you may pass uses_vl at true. It would be good to make this instruction generic. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1509271081 From kvn at openjdk.org Fri Mar 1 17:25:42 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 17:25:42 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18079#pullrequestreview-1911666478 From kvn at openjdk.org Fri Mar 1 17:26:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 17:26:52 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18078#pullrequestreview-1911667546 From gdub at openjdk.org Fri Mar 1 17:54:01 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Fri, 1 Mar 2024 17:54:01 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one Message-ID: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. ------------- Commit messages: - Fix JVMCI Local endBCI off-by-one error - Add javadoc and minimal test for Local.getStart/EndBCI Changes: https://git.openjdk.org/jdk/pull/18087/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18087&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8326692 Stats: 31 lines in 3 files changed: 27 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18087.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18087/head:pull/18087 PR: https://git.openjdk.org/jdk/pull/18087 From kvn at openjdk.org Fri Mar 1 18:41:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 18:41:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: <02i_iMQdWewT8LDPxjnCcJ4EcojEMX763TVw8xGCo5I=.95697bd7-162d-4605-a4f0-b7689ddcbfa4@github.com> On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture For aarch64 do we need also change *.m4 files? Anything in aarch64_vector* files? I will run our testing with current patch. ------------- PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1911800568 From dnsimon at openjdk.org Fri Mar 1 18:54:53 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 1 Mar 2024 18:54:53 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. Thanks for fixing this Gilles. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18087#pullrequestreview-1911818421 From never at openjdk.org Fri Mar 1 19:02:42 2024 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 1 Mar 2024 19:02:42 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18087#pullrequestreview-1911829737 From duke at openjdk.org Fri Mar 1 19:17:12 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 1 Mar 2024 19:17:12 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs Message-ID: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. Below is the performance data on an Intel Tiger Lake machine. Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup -- | -- | -- | -- MathBench.ceilDouble | 547979 | 2170198 | 3.96 MathBench.floorDouble | 547979 | 2167459 | 3.96 MathBench.rintDouble | 547962 | 2130499 | 3.89 ------------- Commit messages: - 8327147: optimized implementation of round operation for x86_64 CPUs Changes: https://git.openjdk.org/jdk/pull/18089/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327147 Stats: 14 lines in 1 file changed: 14 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From kvn at openjdk.org Fri Mar 1 21:07:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 21:07:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture My testing shows that when we do **cross compilation** on linux-x64 I got: Warning: unused operand (no_rax_RegP) Normal linux-x64 build passed. The operand is used only in one place in ZGC barriers code: `src/hotspot/cpu/x86/gc/z/z_x86_64.ad` May be it is not include during cross compilation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1973917187 From kvn at openjdk.org Fri Mar 1 21:27:43 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 21:27:43 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture I am testing the move `operand no_rax_RegP` into `z_x86_64.ad`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1973940841 From kvn at openjdk.org Fri Mar 1 21:32:56 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 21:32:56 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture FTR, I also bailout from adlc parser when we have unused operand to force build failure: +++ b/src/hotspot/share/adlc/archDesc.cpp @@ -773,8 +774,11 @@ bool ArchDesc::check_usage() { cnt++; } } - if (cnt) fprintf(stderr, "\n-------Warning: total %d unused operands\n", cnt); - + if (cnt) { + fprintf(stderr, "\n-------Warning: total %d unused operands\n", cnt); + _semantic_errs++; + return false; + } return true; } I don't think we need it in these changes but it helped me to catch the missing case. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1973947309 From sviswanathan at openjdk.org Fri Mar 1 21:51:00 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 1 Mar 2024 21:51:00 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: <-wYl_0Etz21KtNOTz-Q9-hyxIvKJC2ufHC42IOpYcLM=.e5150619-c877-4527-9772-ea552ce4871c@github.com> On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > src/hotspot/cpu/x86/x86.ad line 3895: > 3893: > 3894: /* > 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ The roundD_mem instruct could be removed now that it is not used. Also the PR could be titled as "Improve performance of Math ceil, floor, and rint for x86". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509583662 From dlong at openjdk.org Fri Mar 1 21:51:00 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 1 Mar 2024 21:51:00 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > src/hotspot/cpu/x86/x86.ad line 3895: > 3893: > 3894: /* > 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ Don't we want roundD_mem enabled, for both roundsd (UseAVX == 0) and vroundsd (UseAVX > 0)? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509586262 From sviswanathan at openjdk.org Fri Mar 1 22:27:52 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 1 Mar 2024 22:27:52 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: <9y7CJbRaeJdn4lR3mwrYbDmvldllWAeUPmmlXHh2Jg0=.c3167586-3cbb-4243-a7e5-d56320a598af@github.com> On Fri, 1 Mar 2024 21:46:37 GMT, Dean Long wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > src/hotspot/cpu/x86/x86.ad line 3895: > >> 3893: >> 3894: /* >> 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ > > Don't we want roundD_mem enabled, for both roundsd (UseAVX == 0) and vroundsd (UseAVX > 0)? @dean-long the roundD_mem instruct is the cause of slow performance due to a false dependency. It generates the instruction of the following form which has a 128 bit result: roundsd xmm0, memory_src, mode vroundsd xmm0, xmm0, memory_src, mode xmm0 bits 0:63 are result of round operation on memory_src xmm0 bits 64:128 are dependent on old value of xmm0 (false dependency) By forcing the load of memory_src into a register before the operation by as below removes the false dependency: vmovsd xmm0, memory_src ; bits 64 and above are cleared by vmovsd vroundsd xmm0, xmm0, xmm0, mode ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509635469 From dlong at openjdk.org Fri Mar 1 23:18:52 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 1 Mar 2024 23:18:52 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <-wYl_0Etz21KtNOTz-Q9-hyxIvKJC2ufHC42IOpYcLM=.e5150619-c877-4527-9772-ea552ce4871c@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> <-wYl_0Etz21KtNOTz-Q9-hyxIvKJC2ufHC42IOpYcLM=.e5150619-c877-4527-9772-ea552ce4871c@github.com> Message-ID: On Fri, 1 Mar 2024 21:44:32 GMT, Sandhya Viswanathan wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > src/hotspot/cpu/x86/x86.ad line 3895: > >> 3893: >> 3894: /* >> 3895: instruct roundD_mem(legRegD dst, memory src, immU8 rmode) %{ > > The roundD_mem instruct could be removed now that it is not used. Also the PR could be titled as "Improve performance of Math ceil, floor, and rint for x86". OK, let's remove roundD_mem to avoid confusion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509660183 From kvn at openjdk.org Fri Mar 1 23:34:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Mar 2024 23:34:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture Testing build with moved `operand no_rax_RegP` passed. Please update changes with this: diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad index d43929efd3e..aef3453b0b1 100644 --- a/src/hotspot/cpu/x86/x86_64.ad +++ b/src/hotspot/cpu/x86/x86_64.ad @@ -2663,18 +2561,6 @@ operand rRegN() %{ // the RBP is used as a proper frame pointer and is not included in ptr_reg. As a // result, RBP is not included in the output of the instruction either. -operand no_rax_RegP() -%{ - constraint(ALLOC_IN_RC(ptr_no_rax_reg)); - match(RegP); - match(rbx_RegP); - match(rsi_RegP); - match(rdi_RegP); - - format %{ %} - interface(REG_INTER); -%} - // This operand is not allowed to use RBP even if // RBP is not used to hold the frame pointer. operand no_rbp_RegP() diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad index d178805dfc7..0cc2ea03b35 100644 --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad @@ -99,6 +99,18 @@ static void z_store_barrier(MacroAssembler& _masm, const MachNode* node, Address %} +operand no_rax_RegP() +%{ + constraint(ALLOC_IN_RC(ptr_no_rax_reg)); + match(RegP); + match(rbx_RegP); + match(rsi_RegP); + match(rdi_RegP); + + format %{ %} + interface(REG_INTER); +%} + // Load Pointer instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) %{ ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1974071060 From jpai at openjdk.org Sat Mar 2 01:47:53 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:53 GMT Subject: RFR: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: <0eNqq-XwZUEVM6vEtKi7WloB0hzw70lcwdp08jyOjbM=.d31c9e5f-2c1f-4d3e-8f75-f85c6b8fb54f@github.com> On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. Thank you Christian and Vladimir for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18078#issuecomment-1974177678 From jpai at openjdk.org Sat Mar 2 01:47:58 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:58 GMT Subject: RFR: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. Thank you for the reviews, Christian and Vladimir. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18079#issuecomment-1974179133 From jpai at openjdk.org Sat Mar 2 01:47:53 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:53 GMT Subject: Integrated: 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 12:50:58 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327108? > > As noted in the JBS issue, before this proposed change, the internal test framework code in `compiler.lib.ir_framework.shared.TestFrameworkSocket` was binding a `java.net.ServerSocket` to "any address". This can lead to interference from other hosts on the network, when the tests are run. The change here proposes to bind this `ServerSocket` to loopback address and reduce the chances of such interference. > > Originally, the interference issues were noticed in CI when `tier3` was run. With the change proposed in this PR, I've run `tier1`, `tier2` and `tier3` in our CI environment and they all passed. This pull request has now been integrated. Changeset: a9c17a22 Author: Jaikiran Pai URL: https://git.openjdk.org/jdk/commit/a9c17a22ca8e64d12e28e272e3f4845297290854 Stats: 7 lines in 1 file changed: 3 ins; 1 del; 3 mod 8327108: compiler.lib.ir_framework.shared.TestFrameworkSocket should listen on loopback address only Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18078 From jpai at openjdk.org Sat Mar 2 01:47:58 2024 From: jpai at openjdk.org (Jaikiran Pai) Date: Sat, 2 Mar 2024 01:47:58 GMT Subject: Integrated: 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:06:18 GMT, Jaikiran Pai wrote: > Can I please get a review of this test-only change which proposes to address https://bugs.openjdk.org/browse/JDK-8327105? > > The commit here changes the internal test class `compiler.compilercontrol.share.scenario.Executor` to bind to a loopback address to prevent other hosts on the network to unexpected communicate on the `ServerSocket`. > > The original interference was noticed in some `tier7` tests which use this `Executor` class. With the change proposed in this PR, `tier1`, `tier2`, `tier3` and `tier7`, `tier8` have been run and that issue hasn't been noticed in this class anymore. This pull request has now been integrated. Changeset: f68a4b9f Author: Jaikiran Pai URL: https://git.openjdk.org/jdk/commit/f68a4b9fc4b0add186754465bbeb908b8362be8d Stats: 7 lines in 2 files changed: 3 ins; 0 del; 4 mod 8327105: compiler.compilercontrol.share.scenario.Executor should listen on loopback address only Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18079 From fyang at openjdk.org Sat Mar 2 03:08:46 2024 From: fyang at openjdk.org (Fei Yang) Date: Sat, 2 Mar 2024 03:08:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture The RISC-V part of the change looks fine. Note that GHA failure is infrastructual. Debian sid is broken for now: https://bugs.openjdk.org/browse/JDK-8326960 ------------- PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1912564165 From epeter at openjdk.org Sat Mar 2 10:58:44 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Sat, 2 Mar 2024 10:58:44 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Tue, 27 Feb 2024 18:23:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> I've designed this benchmark >> >> Nice. Can you also post the generated assembly for Baseline/Patch? >> I'm just worried that there is some method call, or something else that does not get cleanly inlined and could mess with the benchmark. > > @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 > And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c > > I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 @jaskarth It seems we were aware of such issues a long time ago: https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86 So we may actually have use `if` for min/max instead of CMove, at least on some platforms. But some platforms may have worse branch predictors, and then we should use CMove more often. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1974762928 From gli at openjdk.org Sat Mar 2 11:31:51 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 11:31:51 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethodImpl.java line 635: > 633: for (int i = 0; i < localVariableTableLength; i++) { > 634: final int startBci = UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementStartBciOffset); > 635: final int endBci = startBci + UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementLengthOffset) - 1; Just a question: Can the length of a local variable be 0? **If the code length is 0, the `endBci` here may be less than `startBci`.** ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1509950353 From gli at openjdk.org Sat Mar 2 11:44:51 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 11:44:51 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > src/hotspot/cpu/x86/x86.ad line 3894: > 3892: %} > 3893: > 3894: /* Just a notice: if we don't need some code, we should remove them instead of commenting out them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1509951974 From dnsimon at openjdk.org Sat Mar 2 12:12:51 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Sat, 2 Mar 2024 12:12:51 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Sat, 2 Mar 2024 11:28:43 GMT, Guoxiong Li wrote: >> In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). >> On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). >> Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. >> >> A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethodImpl.java line 635: > >> 633: for (int i = 0; i < localVariableTableLength; i++) { >> 634: final int startBci = UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementStartBciOffset); >> 635: final int endBci = startBci + UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementLengthOffset) - 1; > > Just a question: Can the length of a local variable be 0? > > **If the code length is 0, the `endBci` here may be less than `startBci`.** I don't see anything in [JVMS 4.7.13](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.13) that says it cannot be 0. It basically means the LVT entry is useless (denotes a local that is never alive) but is otherwise harmless. Maybe add this to the javadoc for `getEndBci()` to make the API user aware of this corner case: If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1509955642 From gli at openjdk.org Sat Mar 2 12:24:52 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 12:24:52 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Sat, 2 Mar 2024 12:10:35 GMT, Doug Simon wrote: >> src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethodImpl.java line 635: >> >>> 633: for (int i = 0; i < localVariableTableLength; i++) { >>> 634: final int startBci = UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementStartBciOffset); >>> 635: final int endBci = startBci + UNSAFE.getChar(localVariableTableElement + config.localVariableTableElementLengthOffset) - 1; >> >> Just a question: Can the length of a local variable be 0? >> >> **If the code length is 0, the `endBci` here may be less than `startBci`.** > > I don't see anything in [JVMS 4.7.13](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.13) that says it cannot be 0. It basically means the LVT entry is useless (denotes a local that is never alive) but is otherwise harmless. > Maybe add this to the javadoc for `getEndBci()` to make the API user aware of this corner case: > > If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. The reason, which causes this problem, is that the `Local::endBci` includes itself instead of excluding it. But now, we can only fix the javadoc just as you suggested. > If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. `a local variable` may be better to `a local` above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1509957764 From jbhateja at openjdk.org Sat Mar 2 16:22:22 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Mar 2024 16:22:22 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v17] In-Reply-To: References: Message-ID: > Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro. > > ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) > > > 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review resolutions. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16354/files - new: https://git.openjdk.org/jdk/pull/16354/files/b971fbb7..0b270d2e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=15-16 Stats: 25 lines in 4 files changed: 10 ins; 9 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/16354.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16354/head:pull/16354 PR: https://git.openjdk.org/jdk/pull/16354 From jbhateja at openjdk.org Sat Mar 2 16:36:51 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Mar 2024 16:36:51 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > Changes requested by jbhateja (Reviewer). src/hotspot/cpu/x86/x86.ad line 3884: > 3882: > 3883: instruct roundD_reg_avx(legRegD dst, legRegD src, immU8 rmode) %{ > 3884: predicate(UseAVX > 0); can you push the predicate in instruction encoding block and fold this pattern with roundD_reg. ------------- PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1912689349 PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1510001900 From gdub at openjdk.org Sat Mar 2 17:39:52 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Sat, 2 Mar 2024 17:39:52 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: <_JNCAXiYyN2WCHzga4-m0hq4Fy-Na4ssbkegh0e_etE=.b9507756-eb7a-4b81-97d3-78f9824cfa17@github.com> On Sat, 2 Mar 2024 12:21:51 GMT, Guoxiong Li wrote: >> I don't see anything in [JVMS 4.7.13](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7.13) that says it cannot be 0. It basically means the LVT entry is useless (denotes a local that is never alive) but is otherwise harmless. >> Maybe add this to the javadoc for `getEndBci()` to make the API user aware of this corner case: >> >> If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. > > The reason, which causes this problem, is that the `Local::endBci` includes itself instead of excluding it. But now, we can only fix the javadoc just as you suggested. > >> If the value returned is less than {@link #getStartBCI}, this object denotes a local that is never live. > > `a local variable` may be better to `a local` above. I had checked the specs on that and came to the same conclusion. I also think the current state is fine in that regards in terms of code since it just means that there is no bci where this local would be valid when checking both start and end bci. Adding a note about that to the javadoc is a good idea. I'll do that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18087#discussion_r1510010422 From gdub at openjdk.org Sat Mar 2 17:58:01 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Sat, 2 Mar 2024 17:58:01 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one [v2] In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: <9uTHB3xtVIXw_dhZxFZBx6krgmipxeaa3DGIM52ueLs=.f63bd368-80d2-4fe6-b18c-f0896246957e@github.com> > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. Gilles Duboscq has updated the pull request incrementally with one additional commit since the last revision: Add note about zero-length locals ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18087/files - new: https://git.openjdk.org/jdk/pull/18087/files/90e96b4e..fe1ee476 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18087&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18087&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18087.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18087/head:pull/18087 PR: https://git.openjdk.org/jdk/pull/18087 From gli at openjdk.org Sat Mar 2 23:44:51 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 2 Mar 2024 23:44:51 GMT Subject: RFR: 8326692: JVMCI Local.endBci is off-by-one [v2] In-Reply-To: <9uTHB3xtVIXw_dhZxFZBx6krgmipxeaa3DGIM52ueLs=.f63bd368-80d2-4fe6-b18c-f0896246957e@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> <9uTHB3xtVIXw_dhZxFZBx6krgmipxeaa3DGIM52ueLs=.f63bd368-80d2-4fe6-b18c-f0896246957e@github.com> Message-ID: On Sat, 2 Mar 2024 17:58:01 GMT, Gilles Duboscq wrote: >> In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). >> On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). >> Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. >> >> A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. > > Gilles Duboscq has updated the pull request incrementally with one additional commit since the last revision: > > Add note about zero-length locals Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18087#pullrequestreview-1912780304 From gdub at openjdk.org Sun Mar 3 11:08:55 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Sun, 3 Mar 2024 11:08:55 GMT Subject: Integrated: 8326692: JVMCI Local.endBci is off-by-one In-Reply-To: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> References: <1p8-1AyLDSsbjUAOhh6Hmp7ds6IwRahghCaa-s_pPcM=.4ef3a24c-0324-4f3e-81f6-9cc90d4fa8da@github.com> Message-ID: On Fri, 1 Mar 2024 17:48:51 GMT, Gilles Duboscq wrote: > In class files, in the local variable table, local variables have a start BCI and a length. The local variable has a value from BCI (inclusive) until BCI + length (exclusive). > On the other end, JVMCI stores that information in `Local` objects with a start BCI and an end BCI (inclusive). > Currently the parser just uses BCI+length to compute the end BCI, leading to an off-by-one error. > > A simple test checking that the start and end BCIs are within the method's bytecode is added. It fails without the fix. This pull request has now been integrated. Changeset: 31ac8714 Author: Gilles Duboscq URL: https://git.openjdk.org/jdk/commit/31ac8714e0593f2feaa8e9ebaf32bab904ba6d11 Stats: 33 lines in 3 files changed: 29 ins; 1 del; 3 mod 8326692: JVMCI Local.endBci is off-by-one Reviewed-by: dnsimon, never, gli ------------- PR: https://git.openjdk.org/jdk/pull/18087 From epeter at openjdk.org Mon Mar 4 08:13:00 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 08:13:00 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v17] In-Reply-To: References: Message-ID: On Sat, 2 Mar 2024 16:22:22 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) >> >> >> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review resolutions. @jatin-bhateja thanks for all the work, this is a really nice feature! And thanks for baring with all the comments ? Testing up to commit 14 looks good. @PaulSandoz thanks for looking at the Vector API java code! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1913610983 From chagedorn at openjdk.org Mon Mar 4 08:13:55 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: <85eDvI-w8-zdB4MVfI7I7sZ1M63kw4QqDND_2BqMv5w=.415e84ac-4afb-4c69-b4b6-f045dc67449b@github.com> On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Thanks for the careful review! I could have added some more comments explaining, what the follow-up refactoring will do to help to better understand some of the decisions in this patch. I've added replies and will update the PR shortly with the mentionied changes. ------------- PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1913506965 From chagedorn at openjdk.org Mon Mar 4 08:13:56 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 14:09:58 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopPredicate.cpp line 254: >> >>> 252: const Unique_Node_List nodes_with_same_ctrl = find_nodes_with_same_ctrl(start_node, old_uncommon_proj); >>> 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); >>> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); >> >> This was a bit confusing. At first I thought you are cloning the `data_node_graph`, since the `auto` did not tell me that here we are getting a hash-table back. >> I wonder if this cloning should be done in the constructor of `DataNodeGraph`. > > The beauty of packing it into the constructor is that you have fewer lines here. And that is probably beneficial if you are going to use the class elsewhere -> less code duplication. Generally, I think there is this debate about how much work one should do in the constructor (minimal vs. maximal) and I guess there is no clear consensus. In the compiler code, we seem to tend more towards doing the work in the constructor. I agree that packing it all together to hide it from the user is quite nice. However, in this case here, `DataNodeGraph` is later extended (when refactoring `create_bool_from_template_assertion_predicate()` in JDK-8327110/8327111) to not only clone but also clone+transform opaque loop nodes (offering an additional method). This was the main reason I went with a separation here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510673858 From chagedorn at openjdk.org Mon Mar 4 08:13:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:59 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 14:11:17 GMT, Emanuel Peter wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > src/hotspot/share/opto/loopPredicate.cpp line 255: > >> 253: DataNodeGraph data_node_graph(nodes_with_same_ctrl, this); >> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); >> 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); > > And is there a reason why `fix_cloned_data_node_controls` is not part of the `DataNodeGraph` class? Is there any use of the class where we don't have to call `fix_cloned_data_node_controls`? The way we fix the control inputs here is very specific to this code (I don't think we'll do something similar elsewhere). `create_bool_from_template_assertion_predicate()` only does cloning of non-pinned nodes and does not need to do rewire controls - I think this code could be reused at other places as well but that could be cleaned up separately. If we later refactor other cases which needs to rewire the control nodes in a specific way, we could still try to move the code of `fix_cloned_data_node_controls()` inside `DataNodeGraph` and try to share it. > src/hotspot/share/opto/loopPredicate.cpp line 256: > >> 254: auto& orig_to_new = data_node_graph.clone(new_uncommon_proj); >> 255: fix_cloned_data_node_controls(old_uncommon_proj, new_uncommon_proj, orig_to_new); >> 256: Node** cloned_node_ptr = orig_to_new.get(start_node); > > Boah, this `**` is a bit nasty. Would have been nicer if there was a reference pass instead, which checks already that the element exists. That's indeed not that great. Maybe the hash table class should provide an extra function to get references/pointers back (I see why returning a pointer is useful when you directly store objects instead of pointers into the hash table) - not sure though if we should squeeze that into this PR. Maybe in a separate RFE? > src/hotspot/share/opto/loopPredicate.cpp line 265: > >> 263: void PhaseIdealLoop::fix_cloned_data_node_controls( >> 264: const ProjNode* old_uncommon_proj, Node* new_uncommon_proj, >> 265: const ResizeableResourceHashtable& orig_to_new) { > > Suggestion: > > const ResizeableResourceHashtable& orig_to_new) > { > > This might also help with understanding the indentation. But this is a taste question for sure. Will change with indentation fix. > src/hotspot/share/opto/loopPredicate.cpp line 271: > >> 269: set_ctrl(clone, new_uncommon_proj); >> 270: } >> 271: }); > > Indentation is suboptimal here. I found it difficult to read. > Style guide: > > > Indentation for multi-line lambda: > > c.do_entries([&] (const X& x) { > do_something(x, a); > do_something1(x, b); > do_something2(x, c); > }); Good point, I was not aware of this formatting rule. Will fix that. > src/hotspot/share/opto/loopPredicate.cpp line 291: > >> 289: for (uint i = 1; i < next->req(); i++) { >> 290: Node* in = next->in(i); >> 291: if (!in->is_Phi()) { > > What happened with the `is_Phi`? Is it not needed anymore? See later comment in `DataNodeGraph`. > src/hotspot/share/opto/loopnode.hpp line 1921: > >> 1919: rewire_clones_to_cloned_inputs(); >> 1920: return _orig_to_new; >> 1921: } > > Currently, it looks like one could call `clone` multiple times. But I think that would be wrong, right? > That is why I'd put all the active logic in the constructor, and only the passive stuff is publicly accessible, with `const` to indicate that these don't have any effect. Yes, that would be unexpected, so I agree with you here. But as mentioned earlier, we need to add another method to this class later which does the cloning slightly differently, so we cannot do all the work in the constructor. We probably have multiple options here: - Do nothing (could be reasonable as this class is only used rarely and if it's used it's most likely uncommon to clone twice in a row on the same object - and if one does, one probably has a look at the class anyway to notice what to do). - Add asserts to ensure `clone()` is only called once (adds more code but could be a low overhead option - however, we should think about whether we really want to save the user from itself). - Return a copy of the hash table and clear it afterward (seems too much overhead for having no such use-case). I think option 1 and 2 are both fine. > src/hotspot/share/opto/loopopts.cpp line 4519: > >> 4517: _orig_to_new.iterate_all([&](Node* node, Node* clone) { >> 4518: for (uint i = 1; i < node->req(); i++) { >> 4519: Node** cloned_input = _orig_to_new.get(node->in(i)); > > You don't need to check for `is_Phi` on `node->in(i)` anymore? Could have added a comment here about the `is_Phi()` drop. The `DataNodeGraph` class already takes a node collection to clone. We therefore do not need to additionally check for `is_Phi()` here. If an input is a phi, it would not have been cloned in the first place because the node collection does not contain phis (L239): https://github.com/openjdk/jdk/blob/c00cc8ffaee9bf9b3278d84afba0af2ac00134de/src/hotspot/share/opto/loopPredicate.cpp#L231-L245 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510681911 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510686211 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510708206 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510696808 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510699583 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510721511 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510703408 From chagedorn at openjdk.org Mon Mar 4 08:13:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:13:59 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: <7kG-lqyCbXktVszrDRPmc2SkQYXxnqcG9HmCmR5YSCQ=.1bd20dcb-5285-4746-b579-502029c0d9bd@github.com> References: <7kG-lqyCbXktVszrDRPmc2SkQYXxnqcG9HmCmR5YSCQ=.1bd20dcb-5285-4746-b579-502029c0d9bd@github.com> Message-ID: On Fri, 1 Mar 2024 13:58:42 GMT, Emanuel Peter wrote: >> Suggestion for better name `CloneDataNodeGraph`. Do you assert that only data nodes are cloned, and no CFG nodes? > > Yes, you do verify it, great! > You could have a typedef for `ResizeableResourceHashtable`. Then you don't need to use `auto` for it elsewhere, and it is clear what it is. Suggestion: `OrigToNewHashtable`. Good idea. I'll add one. > The name could mention that we are cloning. And maybe you could do the work in the constructor, and just have accessors for the finished products, such as `_orig_to_new`. Suggestion for better name CloneDataNodeGraph. As mentioned earlier, we are later gonna reuse this class when refactoring `create_bool_from_template_assertion_predicate()`. For template assertion predicates we not only need to clone nodes but also need to transform the `OpaqueLoop*Nodes`. Therefore, I went with keeping the name of this class as `DataNodeGraph` and use `_orig_to_new` and not use `_orig_to_clone` since we could be transforming `OpaqueLoop*Nodes` in such a way that we replace it with existing nodes and not clones. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510704146 From epeter at openjdk.org Mon Mar 4 08:19:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 08:19:58 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: <85eDvI-w8-zdB4MVfI7I7sZ1M63kw4QqDND_2BqMv5w=.415e84ac-4afb-4c69-b4b6-f045dc67449b@github.com> References: <85eDvI-w8-zdB4MVfI7I7sZ1M63kw4QqDND_2BqMv5w=.415e84ac-4afb-4c69-b4b6-f045dc67449b@github.com> Message-ID: On Mon, 4 Mar 2024 08:10:17 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Thanks for the careful review! I could have added some more comments explaining, what the follow-up refactoring will do to help to better understand some of the decisions in this patch. I've added replies and will update the PR shortly with the mentionied changes. @chhagedorn ah ok, I see. I didn't quite realize how you were going to extend the code later before your comments. In that case you can of course leave the computations outside the constructor. We can still discuss the final shape of the code once you do the next RFE's on the same code :) I'll wait for your code updates to re-review, just ping me ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18080#issuecomment-1975967965 From jkarthikeyan at openjdk.org Mon Mar 4 08:21:55 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 4 Mar 2024 08:21:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: On Sat, 2 Mar 2024 10:55:41 GMT, Emanuel Peter wrote: >> @eme64 Sure, here is the assembly for the baseline: https://gist.github.com/jaskarth/1fe6f00a5b37fe3efb0dd6a2d24840e0 >> And after: https://gist.github.com/jaskarth/99c56e2f081f996987b96d7e866aca6c >> >> I must have missed this originally when evaluating the benchmark, but looking at the assembly it seems like the baseline JDK creates a `CMove` for that ternary already. I made a quick patch to disable where `PhaseIdealLoop::conditional_move` is called, and the performance still stays the same on the benchmark. I've also attached that assembly if it's of interest: https://gist.github.com/jaskarth/7b12b688f82a3b8e854785f1827b0c20 > > @jaskarth It seems we were aware of such issues a long time ago: > https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86 > So we may actually have use `if` for min/max instead of CMove, at least on some platforms. > But some platforms may have worse branch predictors, and then we should use CMove more often. Hey @eme64, first of all, I want to thank you for your detailed analysis, and the added benchmark! I hope to answer your questions below. > I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster. I think this is a good point as well, I'll try to design a benchmark for this. > You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic? In my benchmark, I found that `testSingleInt` wasn't turned into a `CMove`, but after some more investigation I think this is because of a mistake in my benchmark. I mistakenly select the 0th element of the arrays every time, when I should be randomly selecting indices to prevent a side of the branch from being optimized out. When that change is made, it produces a CMove. I also recall finding cases earlier where CMoves in loops weren't created, but I think this must have been before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451) was integrated. I'll keep searching for more cases, but I tried a few examples and couldn't really find any where a minmax was made but the CMove wasn't. > I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why. I think looking at the code in `PhaseIdealLoop::conditional_move`, the primary difference is that CMove code has an additional cost metric for loops, whereas is_minmax only has the `PROB_UNLIKELY_MAG(2)` check that the CMove logic uses when not in a loop. I think this might potentially lead to minmax transforming cases in loops that `CMove` might not have- but that may not necessarily be desireable. > One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization? The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply. I think the primary benefit of this optimization in straight-line code is the tightened bounds of the Min/Max node as compared to the equivalent Phi or `CMove`. If we have `CMove(0, int_bottom)` then its type would be `int_bottom`, as it does a meet over the operands. But if it were a Max instead, its type would be `[0, int_max]`, which is a sharper type. As an example: int b = ...; // int_bottom int c = b < 0 ? 0 : b; if (c < 0) { ...; // dead code } This example is a bit contrived, but previously that branch would not have been pruned. I found this kind of optimization hard to look for, so I added a temporary field to MaxNode that would only be set to true when MaxNodes were created by is_minmax, and dumped the results of `MaxINode::add_ring` when it was called with the field as true. When running the test suite, I saw there were many cases where this transform was able to create a better type for its operands than an equivalent cmove or phi, and in some cases it was even able to statically determine the operation to be a constant value. > When I ran `make test TEST="micro:IfMinMax" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"` and checked the generated assembly, I did not find any vector instructions. Could it be that `SIZE=300` is too small? I generally use vector sizes in the range of `10_000`, just to make sure it vectorizes. Maybe it is because I have a avx512 machine with 64byte registers, compared to 32byte registers for AVX2? Not sure. That is interesting, as when running that command I see vectorization on my machine, at least with `testVector*`. `testReduction*` still needs that patch you linked earlier to work. I will increase the iteration count as suggested though, in case that is the cause of the discrepancy. > Let me know what you think. Not sure if this regression is important enough, but we need to consider what to do about your patch, as well as the CMove logic that already exists. I think it is definitely worth it considering this regression, as I want to ideally minimize regressions at all. With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on `CMove` directly, as @merykitty suggested earlier. This would mean we wouldn't need to approximate the `CMove` heuristic in `is_minmax`, and that we would see benefits in tandem with improvements to our `CMove` heuristic. That way if the `CMove` heuristic was changed later to take into account the cost behind the cmp, it would also fix this case. Do you have any thoughts on this @eme64 (and @merykitty)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1975971570 From chagedorn at openjdk.org Mon Mar 4 08:41:06 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 08:41:06 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v2] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Review Emanuel: Add typedef and replace usages, format lambda, some renaming ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/c00cc8ff..a569132e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=00-01 Stats: 18 lines in 4 files changed: 3 ins; 2 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From epeter at openjdk.org Mon Mar 4 09:02:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:02:58 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> Message-ID: <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> On Mon, 4 Mar 2024 08:18:41 GMT, Jasmine Karthikeyan wrote: >> @jaskarth It seems we were aware of such issues a long time ago: >> https://bugs.openjdk.org/browse/JDK-8039104: Don't use Math.min/max intrinsic on x86 >> So we may actually have use `if` for min/max instead of CMove, at least on some platforms. >> But some platforms may have worse branch predictors, and then we should use CMove more often. > > Hey @eme64, first of all, I want to thank you for your detailed analysis, and the added benchmark! I hope to answer your questions below. > >> I would like to see a benchmark where you get a regression with your patch if you removed the `PROB_UNLIKELY_MAG(2);` check, or at least make it much smaller. I would like to see if there is some breaking-point where branch prediction is actually faster. > > I think this is a good point as well, I'll try to design a benchmark for this. > >> You seem to have discivered that your last example was already converted to CMove. What cases does your code cover that is not already covered by the `PhaseIdealLoop::conditional_move` logic? > > In my benchmark, I found that `testSingleInt` wasn't turned into a `CMove`, but after some more investigation I think this is because of a mistake in my benchmark. I mistakenly select the 0th element of the arrays every time, when I should be randomly selecting indices to prevent a side of the branch from being optimized out. When that change is made, it produces a CMove. I also recall finding cases earlier where CMoves in loops weren't created, but I think this must have been before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451) was integrated. I'll keep searching for more cases, but I tried a few examples and couldn't really find any where a minmax was made but the CMove wasn't. > >> I think that the CMove logic kicks in for most loops, though maybe not all cases? Would be interesting to know which of your cases were already done by CMove, and which not. And why. > > I think looking at the code in `PhaseIdealLoop::conditional_move`, the primary difference is that CMove code has an additional cost metric for loops, whereas is_minmax only has the `PROB_UNLIKELY_MAG(2)` check that the CMove logic uses when not in a loop. I think this might potentially lead to minmax transforming cases in loops that `CMove` might not have- but that may not necessarily be desireable. > >> One more general issue: So far you have only shown that your optimization leads to speedups in conjunction with auto-vectorization. Do you have any exmamples which get speedups without auto-vectorization? > The thing is: I do hope to do if-conversion in auto-vectorization. Hence, it would be nice to know that your optimization has benefits in cases where if-conversion does not apply. > > I think the primary benefit of this optimization in straight-line code is the tightened bounds of the Min/Max node as compared to the equivalent Ph... @jaskarth > With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on CMove directly You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? I guess that would allow you to get better types, without having to deal with all the CMove-vs-branch-prediction heuristics. BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday: `Branchless Programming in C++ - Fedor Pikus - CppCon 2021` https://www.youtube.com/watch?v=g-WPhYREFjk My conclusion from that: it is really hard to say ahead of time if the branch-predictor is successful. It depends on how predictable a condition is. The branch-predictor can see patterns (like alternating true-false). So even if a probability is 50% on a branch, it may be fully predictable, and branching code is much more efficient than branchless code. But in totally random cases, branchless code may be faster because you will have a large percentage of mispredictions, and mispredictions are expensive. But in both cases you would see `iff->_prob = 0.5`. Really what we would need is profiling that checks how much a branch was `mispredicted`, and not how much it was `taken`. But not sure if we can even get that profiling data. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1976054500 From galder at openjdk.org Mon Mar 4 09:12:12 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 4 Mar 2024 09:12:12 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: > Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. > > The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: > > > $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op > ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op > ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op > ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op > ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op > ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op > ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op > ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op > ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op > ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op > ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op > ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op > ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op > ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op > ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op > > > It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. > > I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. > > > $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" > ... > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 > > > One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? > > Thanks @rwestrel for his help shaping this up :) Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'master' into topic.0131.c1-array-clone - Reserve necessary frame map space for clone use cases - 8302850: C1 primitive array clone intrinsic in graph * Combine array length, new type array and arraycopy for clone in c1 graph. * Add OmitCheckFlags to skip arraycopy checks. * Instantiate ArrayCopyStub only if necessary. * Avoid zeroing newly created arrays for clone. * Add array null after c1 clone compilation test. * Pass force reexecute to intrinsic via value stack. This is needed to be able to deoptimize correctly this intrinsic. * When new type array or array copy are used for the clone intrinsic, their state needs to be based on the state before for deoptimization to work as expected. - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" This reverts commit fe5d916724614391a685bbef58ea939c84197d07. - 8302850: Link code emit infos for null check and alloc array - 8302850: Null check array before getting its length * Added a jtreg test to verify the null check works. Without the fix this test fails with a SEGV crash. - 8302850: Force reexecuting clone in case of a deoptimization * Copy state including locals for clone so that reexecution works as expected. - 8302850: Avoid instantiating array copy stub for clone use cases - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 * Clone calls that involve Phi nodes are not supported. * Add unimplemented stubs for other platforms. ------------- Changes: https://git.openjdk.org/jdk/pull/17667/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17667&range=05 Stats: 218 lines in 16 files changed: 184 ins; 4 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/17667.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17667/head:pull/17667 PR: https://git.openjdk.org/jdk/pull/17667 From chagedorn at openjdk.org Mon Mar 4 09:24:06 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 09:24:06 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Remove dead declaration ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/a569132e..79b8b270 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From epeter at openjdk.org Mon Mar 4 09:36:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:36:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:24:06 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Remove dead declaration Nice, looks better already :) src/hotspot/share/opto/loopnode.hpp line 38: > 36: class BaseCountedLoopEndNode; > 37: class CountedLoopNode; > 38: class DataInputGraph; Suggestion: src/hotspot/share/opto/loopopts.cpp line 4500: > 4498: > 4499: // Clone all nodes in _data_nodes. > 4500: void DataNodeGraph::clone_nodes(Node* new_ctrl) { Suggestion: void DataNodeGraph::clone_data_nodes(Node* new_ctrl) { Then the comment would be obsolete src/hotspot/share/opto/replacednodes.cpp line 211: > 209: } > 210: // Map from current node to cloned/replaced node > 211: OrigToNewHashtable clones(hash_table_size, hash_table_size); Nice. Not your problem here. But should there not be a ResouceMark before this hashtable? There is one at the beginning of the function, but we create many of these hashtables in a loop, without any ResourceMarks in between reclaiming the memory... A but then the hashmaps and stack/to_fix etc would allocate from the ResourceArea, but start at different ResourceMarks... bad idea. Hmm. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1913808738 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510842256 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510845281 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510859503 From epeter at openjdk.org Mon Mar 4 09:36:55 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:36:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:21:31 GMT, Emanuel Peter wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove dead declaration > > src/hotspot/share/opto/loopnode.hpp line 38: > >> 36: class BaseCountedLoopEndNode; >> 37: class CountedLoopNode; >> 38: class DataInputGraph; > > Suggestion: Is also dead ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510852262 From epeter at openjdk.org Mon Mar 4 09:36:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:36:56 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 08:00:04 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/loopnode.hpp line 1921: >> >>> 1919: rewire_clones_to_cloned_inputs(); >>> 1920: return _orig_to_new; >>> 1921: } >> >> Currently, it looks like one could call `clone` multiple times. But I think that would be wrong, right? >> That is why I'd put all the active logic in the constructor, and only the passive stuff is publicly accessible, with `const` to indicate that these don't have any effect. > > Yes, that would be unexpected, so I agree with you here. But as mentioned earlier, we need to add another method to this class later which does the cloning slightly differently, so we cannot do all the work in the constructor. > > We probably have multiple options here: > - Do nothing (could be reasonable as this class is only used rarely and if it's used it's most likely uncommon to clone twice in a row on the same object - and if one does, one probably has a look at the class anyway to notice what to do). > - Add asserts to ensure `clone()` is only called once (adds more code but could be a low overhead option - however, we should think about whether we really want to save the user from itself). > - Return a copy of the hash table and clear it afterward (seems too much overhead for having no such use-case). > > I think option 1 and 2 are both fine. Could you add asserts that the `_orig_to_new` is empty before we clone? That would be a check that nothing was cloned yet, and we do not accidentally mix up two clone operations. >> src/hotspot/share/opto/loopopts.cpp line 4519: >> >>> 4517: _orig_to_new.iterate_all([&](Node* node, Node* clone) { >>> 4518: for (uint i = 1; i < node->req(); i++) { >>> 4519: Node** cloned_input = _orig_to_new.get(node->in(i)); >> >> You don't need to check for `is_Phi` on `node->in(i)` anymore? > > Could have added a comment here about the `is_Phi()` drop. The `DataNodeGraph` class already takes a node collection to clone. We therefore do not need to additionally check for `is_Phi()` here. If an input is a phi, it would not have been cloned in the first place because the node collection does not contain phis (L239): > > https://github.com/openjdk/jdk/blob/c00cc8ffaee9bf9b3278d84afba0af2ac00134de/src/hotspot/share/opto/loopPredicate.cpp#L231-L245 Got it, great. I just looked for a matching `is_Phi` in your diff and did not find it. But it is already covered in existing code, great! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510862784 PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1510847099 From epeter at openjdk.org Mon Mar 4 09:40:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 09:40:56 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <2kIktGLLNDfXbXLEdk1nIKAhMK4_aoGTJAmjgoZj_2k=.ee1a00e8-95e2-4a42-b7d3-4bdb82d981a9@github.com> Message-ID: On Thu, 29 Feb 2024 16:00:15 GMT, Andrew Haley wrote: >> Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;) > >> Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;) > > We already have benchmarks, but the biggest win due to this change is the opportunity to reduce the load on the scoped value cache. > > At present, high performance depends on the per-thread cache, which is a 16-element OOP array. This is a fairly heavyweight structure for virtual threads, which otherwise have a very small heap footprint. With this optimization I think I can shrink the cache without significant loss of performance in most cases. I might also be able to move this cache to the carrier thread. > > So, this patch significantly moves the balance point in the space/speed tradeoff. @theRealAph that is exciting! It's a bit scary to have over 2000 lines for this optimization, makes it quite hard to review. But let's keep working on it ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1976131672 From chagedorn at openjdk.org Mon Mar 4 10:07:24 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 10:07:24 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v4] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: remove useless declaration, clone_nodes -> clone_data_nodes, add assertion to prevent double-usage of clone() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/79b8b270..14b46ba6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=02-03 Stats: 6 lines in 2 files changed: 1 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From epeter at openjdk.org Mon Mar 4 10:10:46 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 10:10:46 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v4] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 10:07:24 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > remove useless declaration, clone_nodes -> clone_data_nodes, add assertion to prevent double-usage of clone() Nice refactoring, looking forward to your next PR's on this ;) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1913911663 From chagedorn at openjdk.org Mon Mar 4 10:13:54 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 10:13:54 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v4] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 10:07:24 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > remove useless declaration, clone_nodes -> clone_data_nodes, add assertion to prevent double-usage of clone() Thanks Emanuel for your review and comments! I'll send it out after this one goes in :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18080#issuecomment-1976226338 From duke at openjdk.org Mon Mar 4 10:57:51 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 10:57:51 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 07:39:16 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > clean for other architecture > For aarch64 do we need also change _.m4 files? Anything in aarch64_vector_ files? > > I will run our testing with current patch. I'm not clear about m4 file. How do we use it? The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1976309600 From duke at openjdk.org Mon Mar 4 11:07:52 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 11:07:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 21:05:11 GMT, Vladimir Kozlov wrote: > My testing shows that when we do **cross compilation** on linux-x64 I got: > > ``` > Warning: unused operand (no_rax_RegP) > ``` > > Normal linux-x64 build passed. > > The operand is used only in one place in ZGC barriers code: `src/hotspot/cpu/x86/gc/z/z_x86_64.ad` May be it is not include during cross compilation. Does the cross compilation disable zgc feature? In my test, it's used by zgc and no warning about it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1976329450 From duke at openjdk.org Mon Mar 4 11:12:04 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 11:12:04 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: Message-ID: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. > I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. kuaiwei has updated the pull request incrementally with one additional commit since the last revision: move no_rax_RegP from x86_64.ad to z_x86_64.ad and comment out immLRot2 in arm_32.ad ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18075/files - new: https://git.openjdk.org/jdk/pull/18075/files/faa8f949..29514638 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=01-02 Stats: 33 lines in 3 files changed: 12 ins; 12 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From duke at openjdk.org Mon Mar 4 11:12:05 2024 From: duke at openjdk.org (kuaiwei) Date: Mon, 4 Mar 2024 11:12:05 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Fri, 1 Mar 2024 23:29:53 GMT, Vladimir Kozlov wrote: > Testing build with moved `operand no_rax_RegP` passed. Please update changes with this: > > ``` > diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad > index d43929efd3e..aef3453b0b1 100644 > --- a/src/hotspot/cpu/x86/x86_64.ad > +++ b/src/hotspot/cpu/x86/x86_64.ad > @@ -2663,18 +2561,6 @@ operand rRegN() %{ > // the RBP is used as a proper frame pointer and is not included in ptr_reg. As a > // result, RBP is not included in the output of the instruction either. > > -operand no_rax_RegP() > -%{ > - constraint(ALLOC_IN_RC(ptr_no_rax_reg)); > - match(RegP); > - match(rbx_RegP); > - match(rsi_RegP); > - match(rdi_RegP); > - > - format %{ %} > - interface(REG_INTER); > -%} > - > // This operand is not allowed to use RBP even if > // RBP is not used to hold the frame pointer. > operand no_rbp_RegP() > diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > index d178805dfc7..0cc2ea03b35 100644 > --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > @@ -99,6 +99,18 @@ static void z_store_barrier(MacroAssembler& _masm, const MachNode* node, Address > > %} > > +operand no_rax_RegP() > +%{ > + constraint(ALLOC_IN_RC(ptr_no_rax_reg)); > + match(RegP); > + match(rbx_RegP); > + match(rsi_RegP); > + match(rdi_RegP); > + > + format %{ %} > + interface(REG_INTER); > +%} > + > // Load Pointer > instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) > %{ > ``` updated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1976336819 From epeter at openjdk.org Mon Mar 4 11:53:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 11:53:07 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" Message-ID: Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). ------------- Commit messages: - remove the assert - 8319690 Changes: https://git.openjdk.org/jdk/pull/18103/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18103&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8319690 Stats: 176 lines in 2 files changed: 172 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18103.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18103/head:pull/18103 PR: https://git.openjdk.org/jdk/pull/18103 From epeter at openjdk.org Mon Mar 4 11:53:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 11:53:07 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). @fg1417 wrote a first [PR](https://github.com/openjdk/jdk/pull/16991), but gave it up. I'm taking over the test, but not her fix. (I had found the original reproducer, but she improved the test further, so I want to give her credit for that) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18103#issuecomment-1976168583 From galder at openjdk.org Mon Mar 4 12:12:43 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 4 Mar 2024 12:12:43 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays In-Reply-To: References: Message-ID: On Thu, 8 Feb 2024 02:17:25 GMT, Dean Long wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > I think the right solution would be to add a line in `GraphBuilder::build_graph_for_intrinsic` for _clone, to append the IR for NewTypeArray and ArrayCopy as if we parsed newarray and arraycopy() from bytecodes. I'll see if I can get that working tomorrow. @dean-long any chance you could have another look at this? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1976440912 From roland at openjdk.org Mon Mar 4 12:35:56 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 4 Mar 2024 12:35:56 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 14:04:06 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/callGenerator.cpp line 854: > >> 852: >> 853: // Pattern matches: >> 854: // if ((objects = scopedValueCache()) != null) { > > Suggestion: > > // if (scopedValueCache() != null) { > > You don't use `objects` here, so it just confused me. I use snippets from the java code for `ScopedValue.get()` in the comments so it's easier to see what's being pattern matched. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1511095016 From chagedorn at openjdk.org Mon Mar 4 12:48:55 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Mar 2024 12:48:55 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v3] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:31:28 GMT, Emanuel Peter wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove dead declaration > > src/hotspot/share/opto/replacednodes.cpp line 211: > >> 209: } >> 210: // Map from current node to cloned/replaced node >> 211: OrigToNewHashtable clones(hash_table_size, hash_table_size); > > Nice. > Not your problem here. But should there not be a ResouceMark before this hashtable? There is one at the beginning of the function, but we create many of these hashtables in a loop, without any ResourceMarks in between reclaiming the memory... > A but then the hashmaps and stack/to_fix etc would allocate from the ResourceArea, but start at different ResourceMarks... bad idea. Hmm. As discussed offline, we should probably go over all uses of resource allocated things like `Node_Lists`, `Node_Stack` etc. at some point and check if there are missing resource marks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18080#discussion_r1511109952 From epeter at openjdk.org Mon Mar 4 13:32:16 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 13:32:16 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - allow only array stores of same type as container - mismatched access test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/8b3a2769..9e642aac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=11-12 Stats: 77 lines in 3 files changed: 58 ins; 0 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From epeter at openjdk.org Mon Mar 4 13:45:49 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 13:45:49 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v4] In-Reply-To: References: Message-ID: On Mon, 29 Jan 2024 12:00:32 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Add diagnostic flag MergeStores > > Great work, Emanuel. > > I think this is a well encapsulated optimization for a supposedly common code pattern requested by core libraries folks. I agree with Vladimir, that it would be nice to support this as part of the autovectorizer but that is probably not going to happen anytime soon. Until then, going with this separate phase would allow us to add support (and tests) for additional code patterns if requests come in and potentially move this to the autovectorizer later. @TobiHartmann @vnkozlov I now check for `AryPtr`. And I think that just marking with "mismatched" must be sufficient. Because if you do a unsafe store with a different memory size, then it is just marked as "mismatched" too. So if I now trigger bugs with this patch, then the bugs were pre-existing and could have been created using unsafe. For example a `StoreB` on an int array: `UNSAFE.putByte(a, UNSAFE.ARRAY_INT_BASE_OFFSET + 3, (byte)0xf4);` `74 StoreB === 42 64 73 70 [[ 16 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=4; mismatched unsafe Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact[0] *, idx=4; !jvms: Test::test8 @ bci:73 (line 78)` There is no barriers around these stores. Of course that would be very different on fields. Fields end up on different slices, and hence you would have to be more careful there. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1976611894 From rcastanedalo at openjdk.org Mon Mar 4 15:21:02 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 4 Mar 2024 15:21:02 GMT Subject: RFR: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() Message-ID: This changeset updates a comment in `G1BarrierSetC2::post_barrier()` to point to the relevant code that must be kept in sync. ------------- Commit messages: - Update comment Changes: https://git.openjdk.org/jdk/pull/18108/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18108&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327224 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18108/head:pull/18108 PR: https://git.openjdk.org/jdk/pull/18108 From duke at openjdk.org Mon Mar 4 15:26:52 2024 From: duke at openjdk.org (ExE Boss) Date: Mon, 4 Mar 2024 15:26:52 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: References: Message-ID: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> On Mon, 4 Mar 2024 13:32:16 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - allow only array stores of same type as container > - mismatched access test Do?we?also have tests that a?compiled method with merged stores like: static void storeLongLE(byte[] bytes, int offset, long value) { bytes[offset + 0] = (byte) (value >> 0); bytes[offset + 1] = (byte) (value >> 8); bytes[offset + 2] = (byte) (value >> 16); bytes[offset + 3] = (byte) (value >> 24); bytes[offset + 4] = (byte) (value >> 32); bytes[offset + 5] = (byte) (value >> 40); bytes[offset + 6] = (byte) (value >> 48); bytes[offset + 7] = (byte) (value >> 56); } still produce the?correct result even when only a?part of?the?stores fit?into the?array, e.g.: var arr = new byte[4]; try { // storeLongLE is already C2 compiled with merged stores: storeLongLE(arr, 0, -1L); throw new AssertionError("Expected ArrayIndexOutOfBoundsException"); } catch (ArrayIndexOutOfBoundsException _) { // ignore } assertTrue( Byte.toUnsignedInt(arr[0]) == 0xFF && Byte.toUnsignedInt(arr[1]) == 0xFF && Byte.toUnsignedInt(arr[2]) == 0xFF && Byte.toUnsignedInt(arr[3]) == 0xFF ); ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1976835427 From psandoz at openjdk.org Mon Mar 4 16:24:59 2024 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 4 Mar 2024 16:24:59 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v17] In-Reply-To: References: Message-ID: On Sat, 2 Mar 2024 16:22:22 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) >> >> >> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review resolutions. Marked as reviewed by psandoz (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1914752388 From duke at openjdk.org Mon Mar 4 16:47:48 2024 From: duke at openjdk.org (Yuri Gaevsky) Date: Mon, 4 Mar 2024 16:47:48 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2] In-Reply-To: References: Message-ID: <2RMDuzc9kW5ibUPCew-gaU_70OH_9c8hQYrrldWYrhQ=.6390d207-e989-48a0-a309-ccf9e6898018@github.com> On Thu, 25 Jan 2024 14:47:47 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with two additional commits since the last revision: > > - num_8b_elems_in_vec --> nof_vec_elems > - Removed checks for (MaxVectorSize >= 16) per @RealFYang suggestion. "Please keep me active" comment. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-1977022513 From aph at openjdk.org Mon Mar 4 16:55:52 2024 From: aph at openjdk.org (Andrew Haley) Date: Mon, 4 Mar 2024 16:55:52 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). Yes, that's right. I wrote that assertion for my own information, this isn't really a bug. I might rework this whole area of the compiler in the future, but there's no urgency. Thanks. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18103#pullrequestreview-1914822754 From dchuyko at openjdk.org Mon Mar 4 17:36:19 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Mon, 4 Mar 2024 17:36:19 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v28] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Deopt osr, cleanups - ... and 36 more: https://git.openjdk.org/jdk/compare/59529a92...a4578277 ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=27 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From kvn at openjdk.org Mon Mar 4 17:44:43 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 17:44:43 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> Message-ID: On Mon, 4 Mar 2024 11:12:04 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > move no_rax_RegP from x86_64.ad to z_x86_64.ad and comment out immLRot2 in arm_32.ad What are latest changes (commented `operand immLRot2()`) in `arm_32.ad` for? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977126409 From kvn at openjdk.org Mon Mar 4 17:44:43 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 17:44:43 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: <7QZA7VSO1iYpL2tFdMDWlz7qmehaYtNDdsWDoWWbbEo=.5f7e8dd4-0d18-4bf7-9244-650c970e0166@github.com> On Mon, 4 Mar 2024 11:05:27 GMT, kuaiwei wrote: >> The operand is used only in one place in ZGC barriers code: src/hotspot/cpu/x86/gc/z/z_x86_64.ad May be it is not include during cross compilation. > Does the cross compilation disable zgc feature? In my test, it's used by zgc and no warning about it. I am not sure what happened there in our testing and did not have time to investigate. The patch worked and it was enough for me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977130438 From kvn at openjdk.org Mon Mar 4 17:58:53 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 17:58:53 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> Message-ID: <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> On Mon, 4 Mar 2024 11:12:04 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > move no_rax_RegP from x86_64.ad to z_x86_64.ad and comment out immLRot2 in arm_32.ad @dholmes-ora asked to disable warning by default and I agree. We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977152941 From kvn at openjdk.org Mon Mar 4 18:20:53 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 18:20:53 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Mon, 4 Mar 2024 10:55:22 GMT, kuaiwei wrote: > I'm not clear about m4 file. How do we use it? > The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. Aarch64 m4 files are used to manually update .ad files. My concern was that could be overlapped code in m4 files which may overwrite your changes in .ad when someone do such manual update in a future. Fortunately `aarch64_ad.m4` does not have operand definitions so your changes are fine. But `aarch64_vector_ad.m4` has them so if we need to change `aarch64_vector.ad` we need to modify m4 file too. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977189798 From kvn at openjdk.org Mon Mar 4 18:23:44 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 18:23:44 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18103#pullrequestreview-1914996677 From epeter at openjdk.org Mon Mar 4 18:25:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 18:25:56 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> References: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> Message-ID: On Mon, 4 Mar 2024 15:24:23 GMT, ExE Boss wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - allow only array stores of same type as container >> - mismatched access test > > Do?we?also have tests that a?compiled method with merged stores like: > > static void storeLongLE(byte[] bytes, int offset, long value) { > bytes[offset + 0] = (byte) (value >> 0); > bytes[offset + 1] = (byte) (value >> 8); > bytes[offset + 2] = (byte) (value >> 16); > bytes[offset + 3] = (byte) (value >> 24); > bytes[offset + 4] = (byte) (value >> 32); > bytes[offset + 5] = (byte) (value >> 40); > bytes[offset + 6] = (byte) (value >> 48); > bytes[offset + 7] = (byte) (value >> 56); > } > > > still produce the?correct result even when only a?part of?the?stores fit?into the?array, e.g.: > > var arr = new byte[4]; > try { > // storeLongLE is already C2 compiled with merged stores: > storeLongLE(arr, 0, -1L); > > throw new AssertionError("Expected ArrayIndexOutOfBoundsException"); > } catch (ArrayIndexOutOfBoundsException _) { > // ignore > } > > assertTrue( > Byte.toUnsignedInt(arr[0]) == 0xFF > && Byte.toUnsignedInt(arr[1]) == 0xFF > && Byte.toUnsignedInt(arr[2]) == 0xFF > && Byte.toUnsignedInt(arr[3]) == 0xFF > ); @ExE-Boss I am working on such a test, thanks for the suggestion! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1977197535 From epeter at openjdk.org Mon Mar 4 18:33:09 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 18:33:09 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v13] In-Reply-To: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> References: <2bsO7BlvpkwQBZ8P19gqOVQQEXua1p7Glnl4WdUjn6g=.87e3fc82-6d37-45e2-ac0e-e69382732dc0@github.com> Message-ID: On Mon, 4 Mar 2024 15:24:23 GMT, ExE Boss wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - allow only array stores of same type as container >> - mismatched access test > > Do?we?also have tests that a?compiled method with merged stores like: > > static void storeLongLE(byte[] bytes, int offset, long value) { > bytes[offset + 0] = (byte) (value >> 0); > bytes[offset + 1] = (byte) (value >> 8); > bytes[offset + 2] = (byte) (value >> 16); > bytes[offset + 3] = (byte) (value >> 24); > bytes[offset + 4] = (byte) (value >> 32); > bytes[offset + 5] = (byte) (value >> 40); > bytes[offset + 6] = (byte) (value >> 48); > bytes[offset + 7] = (byte) (value >> 56); > } > > > still produce the?correct result even when only a?part of?the?stores fit?into the?array, e.g.: > > var arr = new byte[4]; > try { > // storeLongLE is already C2 compiled with merged stores: > storeLongLE(arr, 0, -1L); > > throw new AssertionError("Expected ArrayIndexOutOfBoundsException"); > } catch (ArrayIndexOutOfBoundsException _) { > // ignore > } > > assertTrue( > Byte.toUnsignedInt(arr[0]) == 0xFF > && Byte.toUnsignedInt(arr[1]) == 0xFF > && Byte.toUnsignedInt(arr[2]) == 0xFF > && Byte.toUnsignedInt(arr[3]) == 0xFF > ); @ExE-Boss I have an example, but the IR rules are not yet passing. Need to investigate tomorrow. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1977206515 From epeter at openjdk.org Mon Mar 4 18:33:09 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Mar 2024 18:33:09 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v14] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: WIP test with out of bounds exception ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/9e642aac..638c80f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=12-13 Stats: 227 lines in 2 files changed: 142 ins; 0 del; 85 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From kvn at openjdk.org Mon Mar 4 18:37:50 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Mar 2024 18:37:50 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v14] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 18:33:09 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > WIP test with out of bounds exception This looks good now. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16245#pullrequestreview-1915022379 From sviswanathan at openjdk.org Mon Mar 4 20:24:56 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 4 Mar 2024 20:24:56 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Fri, 1 Mar 2024 06:09:30 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Update description of Poly1305 algo > > src/hotspot/cpu/x86/assembler_x86.cpp line 9115: > >> 9113: >> 9114: void Assembler::vpunpcklqdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { >> 9115: assert(UseAVX > 0, "requires some form of AVX"); > > Add appropriate AVX512VL assertion The VL assertion is already being done as part of vex_prefix_and_encode() and vex_prefix() so no need to add it here. That's why we don't have this assertion in any of the AVX instructions which are promotable to EVEX e.g. vpadd, vpsub, etc. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511747533 From duke at openjdk.org Mon Mar 4 21:40:04 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Mar 2024 21:40:04 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v11] In-Reply-To: References: Message-ID: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: update asserts for vpmadd52l/hq ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17881/files - new: https://git.openjdk.org/jdk/pull/17881/files/b869d874..4a74a773 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=09-10 Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17881.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17881/head:pull/17881 PR: https://git.openjdk.org/jdk/pull/17881 From duke at openjdk.org Mon Mar 4 21:40:05 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Mar 2024 21:40:05 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> On Fri, 1 Mar 2024 17:02:35 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/assembler_x86.cpp line 5148: >> >>> 5146: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >>> 5147: InstructionMark im(this); >>> 5148: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); >> >> uses_vl should be false here. >> >> BTW, this assertion looks very fuzzy, you are checking for two target features in one instruction, apparently, instruction is meant to use AVX512_IFMA only for 512 bit vector length, and for narrower vectors its needs AVX_IFMA. >> >> Lets either keep this strictly for AVX_IFMA for AVX512_IFMA we already have evpmadd52[l/h]uq, if you truly want to make this generic one then split the assertion >> >> `assert ( (avx_ifma && vector_len <= 256) || (avx512_ifma && (vector_len == 512 || VM_Version::support_vl())); >> ` >> >> And then you may pass uses_vl at true. > > It would be good to make this instruction generic. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830438 From duke at openjdk.org Mon Mar 4 21:40:05 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Mar 2024 21:40:05 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> Message-ID: On Fri, 1 Mar 2024 08:16:38 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Update description of Poly1305 algo > > src/hotspot/cpu/x86/assembler_x86.cpp line 5157: > >> 5155: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { >> 5156: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >> 5157: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); > > uses_vl should be false. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. > src/hotspot/cpu/x86/assembler_x86.cpp line 5183: > >> 5181: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); >> 5182: InstructionMark im(this); >> 5183: InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true); > > uses_vl should be false. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. > src/hotspot/cpu/x86/assembler_x86.cpp line 5191: > >> 5189: >> 5190: void Assembler::vpmadd52huq(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) { >> 5191: assert(vector_len == AVX_512bit ? VM_Version::supports_avx512ifma() : VM_Version::supports_avxifma(), ""); > > Same as above. Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830567 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830720 PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1511830942 From kbarrett at openjdk.org Tue Mar 5 00:00:52 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 5 Mar 2024 00:00:52 GMT Subject: RFR: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 15:04:57 GMT, Roberto Casta?eda Lozano wrote: > This changeset updates a comment in `G1BarrierSetC2::post_barrier()` to point to the relevant code that must be kept in sync. Looks good, and trivial. ------------- Marked as reviewed by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18108#pullrequestreview-1915593691 From duke at openjdk.org Tue Mar 5 00:08:05 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 5 Mar 2024 00:08:05 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs [v2] In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: unify the implementation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18089/files - new: https://git.openjdk.org/jdk/pull/18089/files/e8e3b9db..0401e18e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=00-01 Stats: 26 lines in 2 files changed: 0 ins; 25 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From jkarthikeyan at openjdk.org Tue Mar 5 03:32:01 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 5 Mar 2024 03:32:01 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v5] In-Reply-To: References: Message-ID: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Improve single benchmark, increase benchmark loop size ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17574/files - new: https://git.openjdk.org/jdk/pull/17574/files/b368c54d..76424e28 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=03-04 Stats: 17 lines in 1 file changed: 7 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17574/head:pull/17574 PR: https://git.openjdk.org/jdk/pull/17574 From jkarthikeyan at openjdk.org Tue Mar 5 04:10:47 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 5 Mar 2024 04:10:47 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> Message-ID: On Mon, 4 Mar 2024 08:59:46 GMT, Emanuel Peter wrote: > You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? Yeah, I was thinking it might be better to let the CMove transform happen first, since the conditions guarding both transforms are aiming to do the same thing in essence. My thought was that if the regression in your `testCostDifference` was fixed, it would be better to not have to do that fix in two different locations, since it impacts `is_minmax` as well. > BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday Thank you for linking this talk, it was really insightful! I also wonder if it would be possible to capture branch execution patterns somehow, to drive branch flattening optimizations. I figure it could be possible to keep track of the sequence of a branch's history of execution, and then compute some "entropy" value from that sequence to determine if there's a pattern, or if it's random and likely to be mispredicted. However, implementing that in practice sounds pretty difficult. @eme64 I've pushed a commit that fixes the benchmarks and sets the loop iteration count to 10_000. Could you check if this lets it vectorize on your machine? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1977932187 From dholmes at openjdk.org Tue Mar 5 04:32:50 2024 From: dholmes at openjdk.org (David Holmes) Date: Tue, 5 Mar 2024 04:32:50 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> Message-ID: On Mon, 4 Mar 2024 17:56:09 GMT, Vladimir Kozlov wrote: > @dholmes-ora asked to disable warning by default and I agree. What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on the detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977947572 From duke at openjdk.org Tue Mar 5 04:32:52 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 04:32:52 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> Message-ID: On Mon, 4 Mar 2024 17:40:13 GMT, Vladimir Kozlov wrote: > > What are latest changes (commented `operand immLRot2()`) in `arm_32.ad` for? I want to comment out immLRot2. It's mentioned in many todo comments. So I want to keep it without warning. I uses a wrong comment syntax. It's fixed in next patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1977948363 From xliu at openjdk.org Tue Mar 5 05:14:45 2024 From: xliu at openjdk.org (Xin Liu) Date: Tue, 5 Mar 2024 05:14:45 GMT Subject: RFR: 8325681: C2 inliner rejects to inline a deeper callee because the methoddata of caller is immature. [v2] In-Reply-To: References: <_2EG8caEVf3BpvBIu40dwmD01ylzLNgazCIm90i_1Cc=.668bed4c-2814-4d59-9ced-3b441fe7a57a@github.com> <7v7ujChQujNMlcA4ZJweEzW0JCXx2Y_rloxPebHfvss=.f59e12af-f766-45a0-a7d0-0fd01748aec3@github.com> Message-ID: On Tue, 27 Feb 2024 19:42:25 GMT, Vladimir Ivanov wrote: >>> I don't think '_count field of ciCallProfile = -1' is correct for an immature method. it forces c2 to outline a call. C2 should make judgement based on the information HotSpot has collected. The real frequency is > than MinInlineFrequencyRatio. >> >> Indeed, that does look excessive. As another idea: utilize frequencies, but not type profiles at call sites with immature profiles. But then the next question is how representative the profile data at callee side then... > > Actually, there's a similar problematic scenario, now at allocation site: a local allocation followed by a long running loop. > > A a = factoryA::make(...); // new A(...) > for (int i = 0; i < large_count; i++) { > // ... a is eligible for scalarization ... > } > > Inlining `make` method is a prerequisite to scalarize `a`, but profiling data is so scarce and hard-to-gather (a sample per long-running loop), so it's impractical to wait until profiling is over. It's straightforward to prove that `make` frequency is 100% of total executions (since it dominates the loop), but absolute counts don't make it evident. I have 2 thoughts on this problem. 1) ArgEscape won't be a problem if we have stack-allocation. Even under profiling, it won't hurt. 2) for your case and mine, we can leverage iterative EA, BCEscapeAnalyzer and late-inlining. After EA analysis, C2 can have a map. An ArgEscape object maps to a list of function calls. As long as compiler still has budget, C2 can do late-inline for the cheapest obj. it will convert an Argscape to NonEscape. I feel only 1 is a general solution. For 2), it's hard to have a cost model. In your example, we probably need to inline 100 bytecodes(factorA::make) to make 'a' NonEscape. Bigger code may lose to one fast allocation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17957#discussion_r1512142246 From duke at openjdk.org Tue Mar 5 06:40:00 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 06:40:00 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: References: Message-ID: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. > I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. kuaiwei has updated the pull request incrementally with one additional commit since the last revision: 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18075/files - new: https://git.openjdk.org/jdk/pull/18075/files/29514638..5028086a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18075&range=02-03 Stats: 13 lines in 2 files changed: 4 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/18075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18075/head:pull/18075 PR: https://git.openjdk.org/jdk/pull/18075 From duke at openjdk.org Tue Mar 5 06:47:48 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 06:47:48 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: On Mon, 4 Mar 2024 18:18:29 GMT, Vladimir Kozlov wrote: > > I'm not clear about m4 file. How do we use it? > > The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. > > Aarch64 m4 files are used to manually update .ad files. My concern was that could be overlapped code in m4 files which may overwrite your changes in .ad when someone do such manual update in a future. Fortunately `aarch64_ad.m4` does not have operand definitions so your changes are fine. But `aarch64_vector_ad.m4` has them so if we need to change `aarch64_vector.ad` we need to modify m4 file too. I checked all 44 removed operands in aarch64 and find none of them in m4 file. I just grep them and only "indOffI" and "indOffL" appeared because of "vmemA_indOffI4" and "vmemA_indOffL4" . So we need not change m4 file. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1978064822 From duke at openjdk.org Tue Mar 5 06:47:48 2024 From: duke at openjdk.org (kuaiwei) Date: Tue, 5 Mar 2024 06:47:48 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> Message-ID: <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> On Tue, 5 Mar 2024 04:29:08 GMT, David Holmes wrote: >> @dholmes-ora asked to disable warning by default and I agree. >> >> We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` > >> @dholmes-ora asked to disable warning by default and I agree. > > What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on to detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. > @dholmes-ora asked to disable warning by default and I agree. > > We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` I added check of _disable_warnings in adlc. But not enable it in build script. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1978066503 From rcastanedalo at openjdk.org Tue Mar 5 06:59:48 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Mar 2024 06:59:48 GMT Subject: RFR: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 23:58:34 GMT, Kim Barrett wrote: > Looks good, and trivial. Thanks, Kim! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18108#issuecomment-1978077841 From rcastanedalo at openjdk.org Tue Mar 5 06:59:49 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Mar 2024 06:59:49 GMT Subject: Integrated: 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 15:04:57 GMT, Roberto Casta?eda Lozano wrote: > This changeset updates a comment in `G1BarrierSetC2::post_barrier()` to point to the relevant code that must be kept in sync. This pull request has now been integrated. Changeset: 0b959098 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/0b959098be452aa2c9b461c921e11b19678138c7 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8327224: G1: comment in G1BarrierSetC2::post_barrier() refers to nonexistent new_deferred_store_barrier() Reviewed-by: kbarrett ------------- PR: https://git.openjdk.org/jdk/pull/18108 From gcao at openjdk.org Tue Mar 5 07:57:58 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 5 Mar 2024 07:57:58 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 Message-ID: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Hi, please review this patch that fix the minimal build failed for riscv. Error log for minimal build: Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 | ^~~~~~~~~~~~~ | MaxNewSize gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 gmake[3]: *** Waiting for unfinished jobs.... ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 gmake[2]: *** Waiting for unfinished jobs.... ^@ ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) === Output from failing command(s) repeated here === * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 | ^~~~~~~~~~~~~ | MaxNewSize * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. === End of repeated output === No indication of failed target found. HELP: Try searching the build log for '] Error'. HELP: Run 'make doctor' to diagnose build problems. make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 The root cause is that MaxVectorSize defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. Testing: - [x] linux-riscv minimal fastdebug native build ------------- Commit messages: - 8327283: RISC-V: Minimal build failed after JDK-8319716 Changes: https://git.openjdk.org/jdk/pull/18114/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327283 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From bkilambi at openjdk.org Tue Mar 5 08:16:11 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 5 Mar 2024 08:16:11 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: > Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. > > To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. > > With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. > > [AArch64] > On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. > > This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. > > No effects on other platforms. > > [Performance] > FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). > > ADDLanes > > Benchmark Before After Unit > FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms > > > Final code is as below: > > Before: > ` fadda z17.s, p7/m, z17.s, z16.s > ` > After: > > faddp v17.4s, v21.4s, v21.4s > faddp s18, v17.2s > fadd s18, s18, s19 > > > > > [Test] > Full jtreg passed on AArch64 and x86. > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 > [2] https://bugs.openjdk.org/browse/JDK-8275275 > [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments for changes in backend rules and code style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18034/files - new: https://git.openjdk.org/jdk/pull/18034/files/f8492ece..f8f79ac2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=00-01 Stats: 21 lines in 3 files changed: 10 ins; 3 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18034.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18034/head:pull/18034 PR: https://git.openjdk.org/jdk/pull/18034 From bkilambi at openjdk.org Tue Mar 5 08:23:50 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 5 Mar 2024 08:23:50 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Wed, 28 Feb 2024 08:32:52 GMT, Guoxiong Li wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments for changes in backend rules and code style > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 2891: > >> 2889: predicate((!VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n->in(2))) && >> 2890: !n->as_Reduction()->requires_strict_order()) || >> 2891: n->as_Reduction()->requires_strict_order()); > > This predication looks strange and complex. Can it be simplified to `!VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n->in(2))) || n->as_Reduction()->requires_strict_order()`? Hi, thanks for the suggestion. I agree it is a bit cumbersome but I felt it's easier to understand the various conditions on which these SVE instructions can be generated. Nevertheless, the suggested changes feel more compact. I made the changes in the new PS. > src/hotspot/share/opto/vectorIntrinsics.cpp line 1740: > >> 1738: Node* value = nullptr; >> 1739: if (mask == nullptr) { >> 1740: assert(!is_masked_op, "Masked op needs the mask value never null"); > > This assert may be missed after your refactor. But it seems not really matter. Yes, the conditions of `mask != nullptr` should take care of that. > src/hotspot/share/opto/vectornode.hpp line 242: > >> 240: virtual bool requires_strict_order() const { >> 241: return false; >> 242: }; > > The last semicolon is redundant. Done > src/hotspot/share/opto/vectornode.hpp line 265: > >> 263: class AddReductionVFNode : public ReductionNode { >> 264: private: >> 265: bool _requires_strict_order; // false in Vector API. > > The comment `false in Vector API` seems not so clean. We need to state the meaning of the field instead of one of its usages? Done > src/hotspot/share/opto/vectornode.hpp line 276: > >> 274: >> 275: virtual bool cmp(const Node& n) const { >> 276: return Node::cmp(n) && _requires_strict_order== ((ReductionNode&)n).requires_strict_order(); > > Need a space before `==` Done > src/hotspot/share/opto/vectornode.hpp line 297: > >> 295: >> 296: virtual bool cmp(const Node& n) const { >> 297: return Node::cmp(n) && _requires_strict_order== ((ReductionNode&)n).requires_strict_order(); > > Need a space before `==` Done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512315765 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512318336 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512319129 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512319002 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512318763 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512318536 From epeter at openjdk.org Tue Mar 5 08:43:04 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 08:43:04 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v15] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix test for trapping examples ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/638c80f4..4a3ee855 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=13-14 Stats: 27 lines in 1 file changed: 1 ins; 13 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From gli at openjdk.org Tue Mar 5 09:55:47 2024 From: gli at openjdk.org (Guoxiong Li) Date: Tue, 5 Mar 2024 09:55:47 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18103#pullrequestreview-1916417828 From roberto.castaneda.lozano at oracle.com Tue Mar 5 10:19:40 2024 From: roberto.castaneda.lozano at oracle.com (Roberto Castaneda Lozano) Date: Tue, 5 Mar 2024 10:19:40 +0000 Subject: A case where G1/Shenandoah satb barrier is not optimized? In-Reply-To: <4d7f6d11-824b-47d0-8419-06694f695745.yude.lyd@alibaba-inc.com> References: <4d7f6d11-824b-47d0-8419-06694f695745.yude.lyd@alibaba-inc.com> Message-ID: Hi Yude Yin (including hotspot-compiler-dev mailing list), >From what I read in the original JBS issue [1], the g1_can_remove_pre_barrier/g1_can_remove_post_barrier optimization targets writes within simple constructors (such as that of Node within java.util.HashMap [2]), and seems to assume that the situation you describe (several writes to the same field) is either uncommon within this scope or can be reduced by the compiler into a form that is optimizable. In your example, one would hope that the compiler proves that 'ref = a' is redundant and optimizes it away (which would lead to removing all barriers), but this optimization is inhibited by the barrier operations inserted by the compiler in its intermediate representation. These limitations will become easier to overcome with the "Late G1 Barrier Expansion" JEP (in draft status), which proposes hiding barrier code from the compiler's transformations and optimizations [3]. In fact, our current "Late G1 Barrier Expansion" prototype does optimize 'ref = a' away, and removes all barriers in your example. Cheers, Roberto [1] https://bugs.openjdk.org/browse/JDK-8057737 [2] https://github.com/openjdk/jdk/blob/e9adcebaf242843fe2004b01747b5a930b62b291/src/java.base/share/classes/java/util/HashMap.java#L287-L292 [3] https://bugs.openjdk.org/browse/JDK-8322295 ________________________________________ From: hotspot-gc-dev on behalf of Yude Lin Sent: Monday, March 4, 2024 11:32 AM To: hotspot-gc-dev Subject: A case where G1/Shenandoah satb barrier is not optimized? Hi Dear GC devs, I found a case where GC barriers cannot be optimized out. I wonder if anyone could enlighten me on this code: > G1BarrierSetC2::g1_can_remove_pre_barrier (or ShenandoahBarrierSetC2::satb_can_remove_pre_barrier) where there is a condition: > (captured_store == nullptr || captured_store == st_init->zero_memory()) on the store that can be optimized out. The comment says: > The compiler needs to determine that the object in which a field is about > to be written is newly allocated, and that no prior store to the same field > has happened since the allocation. But my understanding is satb barriers of any number of stores immediately (i.e., no in-between safepoints) after an allocation can be optimized out, same field or not. The "no prior store" condition confuses me. What's more, failing to optimize one satb barrier will prevent further barrier optimization that otherwise would be done (maybe due to control flow complexity from the satb barrier). An example would be: public static class TwoFieldObject { public Object ref; public Object ref2; public TwoFieldObject(Object a) { ref = a; } } public static Object testWrite(Object a, Object b, Object c) { TwoFieldObject tfo = new TwoFieldObject(a); tfo.ref = b; // satb barrier of this store cannot be optimized out, and because of its existence, post barrier will also not be optimized out tfo.ref2 = c; // because of the previous store's barriers, pre/post barriers of this store will not be optimized out return tfo; } From rkennke at openjdk.org Tue Mar 5 11:12:57 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 5 Mar 2024 11:12:57 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 Message-ID: A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. ------------- Commit messages: - 8327361: Update some comments after JDK-8139457 Changes: https://git.openjdk.org/jdk/pull/18120/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18120&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327361 Stats: 12 lines in 2 files changed: 0 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/18120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18120/head:pull/18120 PR: https://git.openjdk.org/jdk/pull/18120 From epeter at openjdk.org Tue Mar 5 11:17:47 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 11:17:47 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> Message-ID: <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> On Tue, 5 Mar 2024 04:08:19 GMT, Jasmine Karthikeyan wrote: >> @jaskarth >>> With a bit of further reflection on all this, I think it might be best if this patch was changed so that it acts on CMove directly >> >> You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? I guess that would allow you to get better types, without having to deal with all the CMove-vs-branch-prediction heuristics. >> >> BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday: >> `Branchless Programming in C++ - Fedor Pikus - CppCon 2021` >> https://www.youtube.com/watch?v=g-WPhYREFjk >> >> My conclusion from that: it is really hard to say ahead of time if the branch-predictor is successful. It depends on how predictable a condition is. The branch-predictor can see patterns (like alternating true-false). So even if a probability is 50% on a branch, it may be fully predictable, and branching code is much more efficient than branchless code. But in totally random cases, branchless code may be faster because you will have a large percentage of mispredictions, and mispredictions are expensive. But in both cases you would see `iff->_prob = 0.5`. >> Really what we would need is profiling that checks how much a branch was `mispredicted`, and not how much it was `taken`. But not sure if we can even get that profiling data. > >> You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? > > Yeah, I was thinking it might be better to let the CMove transform happen first, since the conditions guarding both transforms are aiming to do the same thing in essence. My thought was that if the regression in your `testCostDifference` was fixed, it would be better to not have to do that fix in two different locations, since it impacts `is_minmax` as well. > >> BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday > > Thank you for linking this talk, it was really insightful! I also wonder if it would be possible to capture branch execution patterns somehow, to drive branch flattening optimizations. I figure it could be possible to keep track of the sequence of a branch's history of execution, and then compute some "entropy" value from that sequence to determine if there's a pattern, or if it's random and likely to be mispredicted. However, implementing that in practice sounds pretty difficult. > > @eme64 I've pushed a commit that fixes the benchmarks and sets the loop iteration count to 10_000. Could you check if this lets it vectorize on your machine? Thanks! @jaskarth Why don't you first make the code change with starting from a `Cmp -> CMove` pattern rather than the `Cmp -> If -> Phi` pattern. Then I can look at both things together ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1978523865 From duke at openjdk.org Tue Mar 5 11:32:47 2024 From: duke at openjdk.org (Swati Sharma) Date: Tue, 5 Mar 2024 11:32:47 GMT Subject: RFR: 8326421: Add jtreg test for large arrayCopy disjoint case. In-Reply-To: References: Message-ID: On Fri, 23 Feb 2024 18:56:48 GMT, Swati Sharma wrote: > There is already a large suite of arraycopy tests here: https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/arraycopy/stress > > Any reason for not extending that one instead? Hi @shipilev , I tried to extend in the stress framework, below are few points which I observed - As all the test with different primitive types are initializing the orig and test array with MAX_SIZE parameter, so there is no other way to increase the size of array. I tried increasing the MAX_SIZE from 128K to 3MB to cover the test points and that increased the test execution time from 2 minutes to 4 minutes. - The testWith method is using MAX_SIZE parameter to define the array so defining a new array of 4MB size requires to add new test method for all types which I think would duplicate the code. - Current test takes few seconds to execute for large size and has very pointed length test cases instead of random length with both aligned and unaligned cases for byte type. Swati ------------- PR Comment: https://git.openjdk.org/jdk/pull/17962#issuecomment-1978548219 From galder at openjdk.org Tue Mar 5 11:36:46 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 5 Mar 2024 11:36:46 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:06:48 GMT, Roman Kennke wrote: > A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. I think the changes look fine, but looking closer to the original PR, src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.hpp might also need adjusting. s390 and ppc are probably just fine. ------------- Changes requested by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/18120#pullrequestreview-1916637704 From rkennke at openjdk.org Tue Mar 5 11:41:20 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 5 Mar 2024 11:41:20 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: > A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: RISCV changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18120/files - new: https://git.openjdk.org/jdk/pull/18120/files/a14c0c9c..2da3ee69 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18120&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18120&range=00-01 Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/18120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18120/head:pull/18120 PR: https://git.openjdk.org/jdk/pull/18120 From gli at openjdk.org Tue Mar 5 12:16:49 2024 From: gli at openjdk.org (Guoxiong Li) Date: Tue, 5 Mar 2024 12:16:49 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style Looks good except the comment. src/hotspot/share/opto/vectornode.hpp line 268: > 266: // The value is true when add reduction for floats is auto-vectorized as auto-vectorization > 267: // mandates strict ordering but the value is false when this node is generated through VectorAPI > 268: // as VectorAPI does not impose any such rules on ordering. The comment can be more better. But I leave it to a reviewer who is proficient in english to help you improve it. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18034#pullrequestreview-1916715388 PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1512719793 From gcao at openjdk.org Tue Mar 5 12:41:45 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 5 Mar 2024 12:41:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build @robehn Could you please take a look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1978686934 From rehn at openjdk.org Tue Mar 5 12:56:45 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 5 Mar 2024 12:56:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) The MaxVectorSize is defined if JVMCI and/or C2 is defined: `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1978713558 From gcao at openjdk.org Tue Mar 5 13:15:46 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 5 Mar 2024 13:15:46 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build > The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) > > The MaxVectorSize is defined if JVMCI and/or C2 is defined: `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) > > The MaxVectorSize is defined if JVMCI and/or C2 is defined: `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` Yes, You are right. I've considered this way putting the function definition under a ifdef COMPILER2_OR_JVMCI block. But I find that no other CPU does this. I am not sure if there is any other reason for this. But I can do that if we all think it's better. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1978751082 From jbhateja at openjdk.org Tue Mar 5 13:15:47 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Mar 2024 13:15:47 GMT Subject: RFR: 8327147: optimized implementation of round operation for x86_64 CPUs [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 00:08:05 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > unify the implementation Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1916872568 From jbhateja at openjdk.org Tue Mar 5 13:16:48 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Mar 2024 13:16:48 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v11] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 21:40:04 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > update asserts for vpmadd52l/hq Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1916876320 From epeter at openjdk.org Tue Mar 5 13:39:51 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 13:39:51 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 18:20:41 GMT, Vladimir Kozlov wrote: >> Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. >> >> I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). >> >> It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. >> >> But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). > > Good. Thanks @vnkozlov @theRealAph @lgxbslgx for the reviews! Thanks @fg1417 for the original PR and patching up the test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18103#issuecomment-1978790138 From epeter at openjdk.org Tue Mar 5 13:39:53 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 13:39:53 GMT Subject: Integrated: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:46:56 GMT, Emanuel Peter wrote: > Taking this over from @fg1417, she seems to be "on leave" according to her GitHub account. > > I'm taking her regression test (she improved the reproducer that I had originally reported the bug with). > > It is ok to just remove the assert, because the address is "sanitized" after the assert, i.e. we use a `lea` instruction to compute the address. > > But I'm simply removing the assert, as suggested in the comments of the previous [PR](https://github.com/openjdk/jdk/pull/16991#issuecomment-1962307596). This pull request has now been integrated. Changeset: 98f0b866 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/98f0b86641d84048949ed3da1cb14f3820b01c12 Stats: 176 lines in 2 files changed: 172 ins; 4 del; 0 mod 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" Co-authored-by: Fei Gao Reviewed-by: aph, kvn, gli ------------- PR: https://git.openjdk.org/jdk/pull/18103 From jbhateja at openjdk.org Tue Mar 5 14:13:48 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Mar 2024 14:13:48 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> Message-ID: <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> On Mon, 4 Mar 2024 21:36:36 GMT, Srinivas Vamsi Parasa wrote: >> It would be good to make this instruction generic. > > Please see the updated assert as suggested for vpmadd52[l/h]uq in the latest commit. [poly1305_spr_validation.patch](https://github.com/openjdk/jdk/files/14496404/poly1305_spr_validation.patch) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1512889086 From epeter at openjdk.org Tue Mar 5 14:32:02 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 14:32:02 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v16] In-Reply-To: References: Message-ID: > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: - Merge branch 'master' into JDK-8318446 - fix test for trapping examples - WIP test with out of bounds exception - allow only array stores of same type as container - mismatched access test - add test300 - make it happen in post_loop_opts - fix invalid case - cosmetic fixes - New version with ArrayPointer - ... and 36 more: https://git.openjdk.org/jdk/compare/98f0b866...07c233fb ------------- Changes: https://git.openjdk.org/jdk/pull/16245/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=15 Stats: 2391 lines in 13 files changed: 2387 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From epeter at openjdk.org Tue Mar 5 15:29:01 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 15:29:01 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v16] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 14:32:02 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: > > - Merge branch 'master' into JDK-8318446 > - fix test for trapping examples > - WIP test with out of bounds exception > - allow only array stores of same type as container > - mismatched access test > - add test300 > - make it happen in post_loop_opts > - fix invalid case > - cosmetic fixes > - New version with ArrayPointer > - ... and 36 more: https://git.openjdk.org/jdk/compare/98f0b866...07c233fb A blocking issue is now integrated and merged: https://github.com/openjdk/jdk/pull/18103 ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1979030281 From epeter at openjdk.org Tue Mar 5 15:55:12 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 15:55:12 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: Message-ID: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> > This is a feature requiested by @RogerRiggs and @cl4es . > > **Idea** > > Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. > > This patch here supports a few simple use-cases, like these: > > Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 > > Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 > > The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 > > **Details** > > This draft currently implements the optimization in an additional special IGVN phase: > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 > > We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: > > https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 > > Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: a little bit of casting for debug printing code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16245/files - new: https://git.openjdk.org/jdk/pull/16245/files/07c233fb..796d9508 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=15-16 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245 PR: https://git.openjdk.org/jdk/pull/16245 From ddong at openjdk.org Tue Mar 5 15:58:54 2024 From: ddong at openjdk.org (Denghui Dong) Date: Tue, 5 Mar 2024 15:58:54 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag Message-ID: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Hi, Please help review this change that makes TimeLinearScan a develop flag. Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. ------------- Commit messages: - 8327379: Make TimeLinearScan a develop flag Changes: https://git.openjdk.org/jdk/pull/18125/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18125&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327379 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18125/head:pull/18125 PR: https://git.openjdk.org/jdk/pull/18125 From ddong at openjdk.org Tue Mar 5 16:13:09 2024 From: ddong at openjdk.org (Denghui Dong) Date: Tue, 5 Mar 2024 16:13:09 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: > Hi, > > Please help review this change that makes TimeLinearScan a develop flag. > > Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: update header ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18125/files - new: https://git.openjdk.org/jdk/pull/18125/files/6706a1e9..a242dc19 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18125&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18125&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18125/head:pull/18125 PR: https://git.openjdk.org/jdk/pull/18125 From epeter at openjdk.org Tue Mar 5 16:48:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 5 Mar 2024 16:48:58 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout Message-ID: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> This is a regression fix from https://github.com/openjdk/jdk/pull/17657. I had never encountered an example where a data node in the loop body did not have any input node in the loop. My assumption was that this should never happen, such a node should move out of the loop itself. I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 I now had a few options: 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. ------------- Commit messages: - the fix - 8327172 Changes: https://git.openjdk.org/jdk/pull/18123/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18123&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327172 Stats: 103 lines in 3 files changed: 100 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18123.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18123/head:pull/18123 PR: https://git.openjdk.org/jdk/pull/18123 From chagedorn at openjdk.org Tue Mar 5 17:01:45 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Mar 2024 17:01:45 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 14:53:33 GMT, Emanuel Peter wrote: > This is a regression fix from https://github.com/openjdk/jdk/pull/17657. > > I had never encountered an example where a data node in the loop body did not have any input node in the loop. > My assumption was that this should never happen, such a node should move out of the loop itself. > > I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. > > https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 > > I now had a few options: > 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. > 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. > 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. > > I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. That looks reasonable. I agree to fix the ctrl issues separately and go with a bailout solution for now. Maybe you want to add a note at [JDK-8307982](https://bugs.openjdk.org/browse/JDK-8307982) to not forget about this case here. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18123#pullrequestreview-1917612473 From kvn at openjdk.org Tue Mar 5 17:02:47 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Mar 2024 17:02:47 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> References: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> Message-ID: On Tue, 5 Mar 2024 06:40:00 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad This looks good now. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1917613973 From sviswanathan at openjdk.org Tue Mar 5 18:51:45 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 5 Mar 2024 18:51:45 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 00:08:05 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > unify the implementation Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1917873194 From kvn at openjdk.org Tue Mar 5 18:55:46 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Mar 2024 18:55:46 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 14:53:33 GMT, Emanuel Peter wrote: > This is a regression fix from https://github.com/openjdk/jdk/pull/17657. > > I had never encountered an example where a data node in the loop body did not have any input node in the loop. > My assumption was that this should never happen, such a node should move out of the loop itself. > > I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. > > https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 > > I now had a few options: > 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. > 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. > 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. > > I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18123#pullrequestreview-1917880586 From dlong at openjdk.org Tue Mar 5 22:40:47 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 5 Mar 2024 22:40:47 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 00:08:05 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > unify the implementation So if we can still generate the non-AVX encoding of `roundsd dst, src, mode` isn't there still a false dependency problem with `dst`? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1979755885 From ksakata at openjdk.org Wed Mar 6 00:33:45 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Wed, 6 Mar 2024 00:33:45 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES Message-ID: This pull request removes an unnecessary directive. There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. I built OpenJDK with Zero VM as a test. It was successful. $ ./configure --with-jvm-variants=zero --enable-debug $ make images $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version openjdk version "23-internal" 2024-09-17 OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. ------------- Commit messages: - Remove DONT_USE_REGISTER_DEFINES Changes: https://git.openjdk.org/jdk/pull/18115/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18115&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8323242 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18115/head:pull/18115 PR: https://git.openjdk.org/jdk/pull/18115 From gli at openjdk.org Wed Mar 6 00:33:45 2024 From: gli at openjdk.org (Guoxiong Li) Date: Wed, 6 Mar 2024 00:33:45 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. Looks good. Some related issues: [JDK-8269122](https://bugs.openjdk.org/browse/JDK-8269122) [JDK-8282085](https://bugs.openjdk.org/browse/JDK-8282085) [JDK-8200168](https://bugs.openjdk.org/browse/JDK-8200168) [JDK-8297445](https://bugs.openjdk.org/browse/JDK-8297445) Please fix the title of the issue or this PR. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18115#pullrequestreview-1917283298 PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-1978967341 From duke at openjdk.org Wed Mar 6 02:25:51 2024 From: duke at openjdk.org (Joshua Cao) Date: Wed, 6 Mar 2024 02:25:51 GMT Subject: RFR: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 Message-ID: As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. Passing hotspot tier1 locally on my Linux machine. ------------- Commit messages: - 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 Changes: https://git.openjdk.org/jdk/pull/18130/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18130&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327201 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18130.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18130/head:pull/18130 PR: https://git.openjdk.org/jdk/pull/18130 From amitkumar at openjdk.org Wed Mar 6 02:42:46 2024 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 6 Mar 2024 02:42:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> Message-ID: On Tue, 5 Mar 2024 06:44:44 GMT, kuaiwei wrote: >>> @dholmes-ora asked to disable warning by default and I agree. >> >> What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on to detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. > >> @dholmes-ora asked to disable warning by default and I agree. >> >> We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` > > I added check of _disable_warnings in adlc. But not enable it in build script. @kuaiwei you need one more approval from **R**eviewer, before integrating hotspot change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1979975733 From duke at openjdk.org Wed Mar 6 04:03:45 2024 From: duke at openjdk.org (kuaiwei) Date: Wed, 6 Mar 2024 04:03:45 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> Message-ID: On Tue, 5 Mar 2024 06:44:44 GMT, kuaiwei wrote: >>> @dholmes-ora asked to disable warning by default and I agree. >> >> What I said was that until all the known issues are resolved then the warning should be disabled. If this PR fixes every warning that has been spotted then that is fine - the warning can remain on to detect new problems creeping in. Otherwise issues should be filed to fix all remaining warnings and the warning disabled until they are all addressed. We have been lucky that these unexpected warnings have only caused minimal disruption to our builds. > >> @dholmes-ora asked to disable warning by default and I agree. >> >> We can use `AD._disable_warnings` flag to guard these warnings and add corresponding `-w` flag to `adlc` command in `GensrcAdlc.gmk` > > I added check of _disable_warnings in adlc. But not enable it in build script. > @kuaiwei you need one more approval from **R**eviewer, before integrating hotspot change. Ok , I will wait for another review. May I rollback the integrate request? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1980048043 From gli at openjdk.org Wed Mar 6 04:08:46 2024 From: gli at openjdk.org (Guoxiong Li) Date: Wed, 6 Mar 2024 04:08:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v3] In-Reply-To: References: <4t13JRtX8t_f_tLPm11vACsLYYV_n8_H_13DYrPV-ZE=.998a0628-2da0-4094-8f7e-9e992dbe27a1@github.com> <-otSCbPTibs2l97RfYxBNnTGUjRnUft6UrIVk33woSQ=.c1604e01-0f8b-46ea-9cd8-809775f669b9@github.com> <1xaaqBkBXk0Cor7yFYD4yMht4UOHuR-u5W5lilxr2R0=.9e9fe3c0-03c2-4f83-a0f3-9a06cdd8dacb@github.com> Message-ID: On Wed, 6 Mar 2024 04:01:05 GMT, kuaiwei wrote: > Ok , I will wait for another review. May I rollback the integrate request? You can use the command `/reviewers 2 reviewer` to impose restriction. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1980052009 From vlivanov at openjdk.org Wed Mar 6 04:31:45 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 6 Mar 2024 04:31:45 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> References: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> Message-ID: <3v2QNFwDKc2i5JDyiacYG6z4uLu4kUdvdjZGdzmryo4=.e84c917a-4379-4102-856c-b929a0ea384b@github.com> On Tue, 5 Mar 2024 06:40:00 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18075#pullrequestreview-1918722448 From vlivanov at openjdk.org Wed Mar 6 04:35:46 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 6 Mar 2024 04:35:46 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v2] In-Reply-To: References: <6NCooweiav2igPTPCfS7qaOvEaBwcbmS8fF5QlVNRaw=.cdfd4e99-8e97-425f-96de-cb75f7f11eea@github.com> Message-ID: <_kIYOZBn9wfHo5YoIxkOul4P0sZJXEZ4fAbsczPxy_Q=.4f13a415-fc2e-4f1f-86c3-e648537c2abf@github.com> On Mon, 4 Mar 2024 18:18:29 GMT, Vladimir Kozlov wrote: >>> For aarch64 do we need also change _.m4 files? Anything in aarch64_vector_ files? >>> >>> I will run our testing with current patch. >> >> I'm not clear about m4 file. How do we use it? >> The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. > >> I'm not clear about m4 file. How do we use it? >> The build script of jdk will combine all ad file into one single file and adlc will compile it. So aarch64_vector.ad will be checked as well. > > Aarch64 m4 files are used to manually update .ad files. My concern was that could be overlapped code in m4 files which may overwrite your changes in .ad when someone do such manual update in a future. > Fortunately `aarch64_ad.m4` does not have operand definitions so your changes are fine. > But `aarch64_vector_ad.m4` has them so if we need to change `aarch64_vector.ad` we need to modify m4 file too. No need to retract integration request. As the bot reported earlier, you need a Commiter to sponsor the PR. But, please, wait until @vnkozlov confirms that testing results are good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1980071400 From gcao at openjdk.org Wed Mar 6 05:15:05 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 05:15:05 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array Message-ID: Hi, I noticed that RISC-V missed this change from https://github.com/openjdk/jdk/pull/11044 , comments as follow [1]: `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 ### Tests - [x] Run tier1-3 tests on SiFive unmatched (release) ------------- Commit messages: - 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array Changes: https://git.openjdk.org/jdk/pull/18131/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18131&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327426 Stats: 16 lines in 1 file changed: 6 ins; 7 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18131.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18131/head:pull/18131 PR: https://git.openjdk.org/jdk/pull/18131 From jkarthikeyan at openjdk.org Wed Mar 6 06:13:02 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 6 Mar 2024 06:13:02 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: References: Message-ID: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Change transform to work on CMoves ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17574/files - new: https://git.openjdk.org/jdk/pull/17574/files/76424e28..2adebb73 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=04-05 Stats: 155 lines in 3 files changed: 78 ins; 69 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/17574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17574/head:pull/17574 PR: https://git.openjdk.org/jdk/pull/17574 From jkarthikeyan at openjdk.org Wed Mar 6 06:13:02 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 6 Mar 2024 06:13:02 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v3] In-Reply-To: <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> References: <2sBnEs205q7O3Fn1_ZRh_KfM3Yb3q6p8f7tLSwDvcOc=.ea552196-61b9-4fa6-980f-54f280ea5662@github.com> <_yIQLmJFXOolbLAS8Wcxgl1juRlQwB0OWkKd8ZMcfmg=.9ed4a52d-9ffb-45eb-a0dc-7b3201974882@github.com> <-Cwct-5ZBYHEG-67r6xe4by7s0rI7w27ogfdJcIEBrw=.e4c95ed8-6353-46b9-a946-3bf2b2c47765@github.com> Message-ID: On Tue, 5 Mar 2024 11:14:51 GMT, Emanuel Peter wrote: >>> You mean you would be matching for a `Cmp -> CMove` node pattern that is equivalent for `Min/Max`, rather than matching a `Cmp -> If -> Phi` pattern? >> >> Yeah, I was thinking it might be better to let the CMove transform happen first, since the conditions guarding both transforms are aiming to do the same thing in essence. My thought was that if the regression in your `testCostDifference` was fixed, it would be better to not have to do that fix in two different locations, since it impacts `is_minmax` as well. >> >>> BTW, I watched a fascinating talk about branch-predictors / branchless code yesterday >> >> Thank you for linking this talk, it was really insightful! I also wonder if it would be possible to capture branch execution patterns somehow, to drive branch flattening optimizations. I figure it could be possible to keep track of the sequence of a branch's history of execution, and then compute some "entropy" value from that sequence to determine if there's a pattern, or if it's random and likely to be mispredicted. However, implementing that in practice sounds pretty difficult. >> >> @eme64 I've pushed a commit that fixes the benchmarks and sets the loop iteration count to 10_000. Could you check if this lets it vectorize on your machine? Thanks! > > @jaskarth Why don't you first make the code change with starting from a `Cmp -> CMove` pattern rather than the `Cmp -> If -> Phi` pattern. Then I can look at both things together ;) @eme64 Sure, I've updated the patch accordingly :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1980154585 From gli at openjdk.org Wed Mar 6 06:19:49 2024 From: gli at openjdk.org (Guoxiong Li) Date: Wed, 6 Mar 2024 06:19:49 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: On Tue, 5 Mar 2024 16:13:09 GMT, Denghui Dong wrote: >> Hi, >> >> Please help review this change that makes TimeLinearScan a develop flag. >> >> Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. > > Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: > > update header The patch looks good. But I don't really know whether it deserves to do that. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18125#pullrequestreview-1918830275 From chagedorn at openjdk.org Wed Mar 6 07:07:44 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 07:07:44 GMT Subject: RFR: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 02:21:10 GMT, Joshua Cao wrote: > As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. > > Passing hotspot tier1 locally on my Linux machine. Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18130#pullrequestreview-1918916720 From roberto.castaneda.lozano at oracle.com Wed Mar 6 08:58:36 2024 From: roberto.castaneda.lozano at oracle.com (Roberto Castaneda Lozano) Date: Wed, 6 Mar 2024 08:58:36 +0000 Subject: [External] : Re: A case where G1/Shenandoah satb barrier is not optimized? In-Reply-To: References: <4d7f6d11-824b-47d0-8419-06694f695745.yude.lyd@alibaba-inc.com>, Message-ID: The JEP is still in draft mode, so no targeted JDK release yet. My hope is that it will be accepted as a Candidate JEP in the upcoming weeks. Cheers, Roberto ________________________________________ From: Yude Lin Sent: Wednesday, March 6, 2024 9:22 AM To: Roberto Castaneda Lozano; hotspot-gc-dev; hotspot-compiler-dev at openjdk.org Subject: [External] : Re: A case where G1/Shenandoah satb barrier is not optimized? Thanks Roberto and Thomas. By the way, is late barrier expansion aiming at a certain JDK release? Cheers, Yude ------------------------------------------------------------------ From:Roberto Castaneda Lozano Send Time:2024$BG/(B3$B7n(B5$BF|(B($B at 14|Fs(B) 18:19 To:hotspot-gc-dev ; $BNS0iy~(B($B8fy~(B) ; hotspot-compiler-dev at openjdk.org Subject:Re: A case where G1/Shenandoah satb barrier is not optimized? Hi Yude Yin (including hotspot-compiler-dev mailing list), >From what I read in the original JBS issue [1], the g1_can_remove_pre_barrier/g1_can_remove_post_barrier optimization targets writes within simple constructors (such as that of Node within java.util.HashMap [2]), and seems to assume that the situation you describe (several writes to the same field) is either uncommon within this scope or can be reduced by the compiler into a form that is optimizable. In your example, one would hope that the compiler proves that 'ref = a' is redundant and optimizes it away (which would lead to removing all barriers), but this optimization is inhibited by the barrier operations inserted by the compiler in its intermediate representation. These limitations will become easier to overcome with the "Late G1 Barrier Expansion" JEP (in draft status), which proposes hiding barrier code from the compiler's transformations and optimizations [3]. In fact, our current "Late G1 Barrier Expansion" prototype does optimize 'ref = a' away, and removes all barriers in your example. Cheers, Roberto [1] https://bugs.openjdk.org/browse/JDK-8057737 [2] https://github.com/openjdk/jdk/blob/e9adcebaf242843fe2004b01747b5a930b62b291/src/java.base/share/classes/java/util/HashMap.java#L287-L292 [3] https://bugs.openjdk.org/browse/JDK-8322295 ________________________________________ From: hotspot-gc-dev on behalf of Yude Lin Sent: Monday, March 4, 2024 11:32 AM To: hotspot-gc-dev Subject: A case where G1/Shenandoah satb barrier is not optimized? Hi Dear GC devs, I found a case where GC barriers cannot be optimized out. I wonder if anyone could enlighten me on this code: > G1BarrierSetC2::g1_can_remove_pre_barrier (or ShenandoahBarrierSetC2::satb_can_remove_pre_barrier) where there is a condition: > (captured_store == nullptr || captured_store == st_init->zero_memory()) on the store that can be optimized out. The comment says: > The compiler needs to determine that the object in which a field is about > to be written is newly allocated, and that no prior store to the same field > has happened since the allocation. But my understanding is satb barriers of any number of stores immediately (i.e., no in-between safepoints) after an allocation can be optimized out, same field or not. The "no prior store" condition confuses me. What's more, failing to optimize one satb barrier will prevent further barrier optimization that otherwise would be done (maybe due to control flow complexity from the satb barrier). An example would be: public static class TwoFieldObject { public Object ref; public Object ref2; public TwoFieldObject(Object a) { ref = a; } } public static Object testWrite(Object a, Object b, Object c) { TwoFieldObject tfo = new TwoFieldObject(a); tfo.ref = b; // satb barrier of this store cannot be optimized out, and because of its existence, post barrier will also not be optimized out tfo.ref2 = c; // because of the previous store's barriers, pre/post barriers of this store will not be optimized out return tfo; } From roland at openjdk.org Wed Mar 6 09:00:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 09:00:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Tue, 5 Mar 2024 15:55:12 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > a little bit of casting for debug printing code Do you intend to add an IR test case? src/hotspot/share/opto/compile.cpp line 2927: > 2925: } > 2926: > 2927: void Compile::gather_nodes_for_merge_stores(PhaseIterGVN &igvn) { This is going away, right? src/hotspot/share/opto/memnode.cpp line 2802: > 2800: StoreNode* use = can_merge_primitive_array_store_with_use(phase, true); > 2801: if (use != nullptr) { > 2802: return nullptr; Do you want to assert that the use is in the igvn worklist? src/hotspot/share/opto/memnode.cpp line 2971: > 2969: // The goal is to check if two such ArrayPointers are adjacent for a load or store. > 2970: // > 2971: // Note: we accumulate all constant offsets into constant_offset, even the int constant behind Is this really needed? For the patterns of interest, aren't the constant pushed down the chain of `AddP` nodes so the address is `(AddP base (AddP ...) constant)`? src/hotspot/share/opto/memnode.cpp line 3146: > 3144: Node* ctrl_s1 = s1->in(MemNode::Control); > 3145: Node* ctrl_s2 = s2->in(MemNode::Control); > 3146: if (ctrl_s1 != ctrl_s2) { Do you need to check that `ctrl_s1` and `ctrl_s2` are not null? I suppose this could be called on a dying part of the graph during igvn. src/hotspot/share/opto/memnode.cpp line 3154: > 3152: } > 3153: ProjNode* other_proj = ctrl_s1->as_IfProj()->other_if_proj(); > 3154: if (other_proj->is_uncommon_trap_proj(Deoptimization::Reason_range_check) == nullptr || This could be a range check for an unrelated array I suppose. Does it matter? src/hotspot/share/opto/memnode.hpp line 578: > 576: > 577: Node* Ideal_merge_primitive_array_stores(PhaseGVN* phase); > 578: StoreNode* can_merge_primitive_array_store_with_use(PhaseGVN* phase, bool check_def); If I understand correctly you need the `check_def ` parameter to avoid having `can_merge_primitive_array_store_with_use` and `can_merge_primitive_array_store_with_def` call each other indefinitely. But if I was to write new code that takes advantage of one of the two methods, I think I would be puzzled that there's a `check_def` parameter. Passing `false` would be wrong then but maybe not immediately obvious. Maybe it would be better to have `can_merge_primitive_array_store_with_def` with no `check_def` parameter and have all the work done in a utility method that takes a `check_def` parameter (always `true` when called from `can_merge_primitive_array_store_with_def`) ------------- PR Review: https://git.openjdk.org/jdk/pull/16245#pullrequestreview-1919069208 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514032470 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514033069 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514071393 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514057889 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514062586 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514051769 From kvn at openjdk.org Wed Mar 6 09:12:46 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Mar 2024 09:12:46 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: On Tue, 5 Mar 2024 16:13:09 GMT, Denghui Dong wrote: >> Hi, >> >> Please help review this change that makes TimeLinearScan a develop flag. >> >> Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. > > Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: > > update header Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18125#pullrequestreview-1919164251 From shade at openjdk.org Wed Mar 6 09:21:51 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 6 Mar 2024 09:21:51 GMT Subject: RFR: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 02:21:10 GMT, Joshua Cao wrote: > As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. > > Passing hotspot tier1 locally on my Linux machine. Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18130#pullrequestreview-1919183962 From duke at openjdk.org Wed Mar 6 09:21:51 2024 From: duke at openjdk.org (Joshua Cao) Date: Wed, 6 Mar 2024 09:21:51 GMT Subject: Integrated: 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 In-Reply-To: References: Message-ID: <92O2-fcIQrWJVcI8wTrbrVe35_gvMchEpoU6QCSAO-A=.10ccbbf1-d387-443b-bc10-672d649473e5@github.com> On Wed, 6 Mar 2024 02:21:10 GMT, Joshua Cao wrote: > As Aleksey pointed out, the issue seems innocuous. It seems that all code that uses `pre_loop_end` are called from the main loop, and the field is always initialized for main loops. But we should still avoid uninitialized fields. > > Passing hotspot tier1 locally on my Linux machine. This pull request has now been integrated. Changeset: fbb422ec Author: Joshua Cao Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/fbb422ece7ff61bc10ebafe48ecb7f17ea315682 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8327201: C2: Uninitialized VLoop::_pre_loop_end after JDK-8324890 Reviewed-by: chagedorn, shade ------------- PR: https://git.openjdk.org/jdk/pull/18130 From epeter at openjdk.org Wed Mar 6 10:16:51 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Mar 2024 10:16:51 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Wed, 6 Mar 2024 08:36:28 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/compile.cpp line 2927: > >> 2925: } >> 2926: >> 2927: void Compile::gather_nodes_for_merge_stores(PhaseIterGVN &igvn) { > > This is going away, right? Good catch, it is now dead code! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1514201265 From duke at openjdk.org Wed Mar 6 10:52:59 2024 From: duke at openjdk.org (Oussama Louati) Date: Wed, 6 Mar 2024 10:52:59 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v2] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with 12 additional commits since the last revision: - Refactor generateCPEntryData method signature - Delete HandleType.java as it's not used anymore change method signature - Refactor createThrowRuntimeExceptionCodeHelper method signature - Optimize imports and fix bugs - Refactor GenFullCP.java: Import cleanup and bug fixes - Refactor code to use ClassFile.of().parse() method in GenFullCP.java - Refactor generateCPEntryData method to use ClassModel and ClassFile APIs - refactor to remove unnecessary whitespaces - Refactor createThrowRuntimeExceptionCodeHelper method to use classfile API - Fix indentation in GenManyIndyCorrectBootstrap.java - ... and 2 more: https://git.openjdk.org/jdk/compare/47f24fb6...03a5e325 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/47f24fb6..03a5e325 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=00-01 Stats: 494 lines in 7 files changed: 106 ins; 104 del; 284 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From fyang at openjdk.org Wed Mar 6 11:37:45 2024 From: fyang at openjdk.org (Fei Yang) Date: Wed, 6 Mar 2024 11:37:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. ------------- PR Review: https://git.openjdk.org/jdk/pull/18114#pullrequestreview-1919503958 From rehn at openjdk.org Wed Mar 6 12:23:45 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 6 Mar 2024 12:23:45 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 12:52:29 GMT, Robbin Ehn wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. > So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) > > The MaxVectorSize is defined if JVMCI and/or C2 is defined: > `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. This also makes it clear that C1/interpreter don't use them, hence if someone needs a speed up there they could try to make use of them. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1980750493 From roland at openjdk.org Wed Mar 6 13:31:52 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:31:52 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 15:22:18 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/compile.cpp line 2352: > >> 2350: if (failing()) return; >> 2351: >> 2352: inline_scoped_value_get_calls(igvn); > > Suggestion: > > inline_scoped_value_get_calls(igvn); > > Indentation was wrong. Was this a result of an IDE correcting the missing braces around the if above? The indentation in that part of `Compile::Optimize()` is wrong (only a single extra character indentation after opening brace at line 2327) and the indentation for that line is correct... but not in line with the code around it. I changed it but `Compile::Optimize()` is the one that would need to be fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1514485609 From roland at openjdk.org Wed Mar 6 13:38:50 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:38:50 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 15:57:10 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/loopPredicate.cpp line 1662: > >> 1660: T_ADDRESS, MemNode::unordered); >> 1661: _igvn.register_new_node_with_optimizer(handle_load); >> 1662: set_subtree_ctrl(handle_load, true); > > How impossible is it to share code with the similar code in `GraphKit`? We would need something like what is done for `Phase::gen_subtype_check()` that is move the code out of `GraphKit` and we can't access the `GraphKit` helper methods (`basic_plus_adr()`, `make_load()` etc.) so the result would be less readable that the code in `GraphKit`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1514497015 From roland at openjdk.org Wed Mar 6 13:51:16 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:51:16 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v10] In-Reply-To: References: Message-ID: <_AgilPfFs90WJmhoV-wkNSd8Rq6ojwfDNC2SYHgpbWQ=.728d3cb4-bca5-4831-8959-44198905e34f@github.com> > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: - review - Merge branch 'master' into JDK-8320649 - review - 32 bit build fix - fix & test - Merge branch 'master' into JDK-8320649 - review - review comment - Merge branch 'master' into JDK-8320649 - Update src/hotspot/share/opto/callGenerator.cpp Co-authored-by: Emanuel Peter - ... and 6 more: https://git.openjdk.org/jdk/compare/0583f735...57592601 ------------- Changes: https://git.openjdk.org/jdk/pull/16966/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=09 Stats: 2656 lines in 39 files changed: 2587 ins; 29 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From roland at openjdk.org Wed Mar 6 13:51:16 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:51:16 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Fri, 16 Feb 2024 09:40:17 GMT, Roland Westrelin wrote: >> This change implements C2 optimizations for calls to >> ScopedValue.get(). Indeed, in: >> >> >> v1 = scopedValue.get(); >> ... >> v2 = scopedValue.get(); >> >> >> `v2` can be replaced by `v1` and the second call to `get()` can be >> optimized out. That's true whatever is between the 2 calls unless a >> new mapping for `scopedValue` is created in between (when that happens >> no optimizations is performed for the method being compiled). Hoisting >> a `get()` call out of loop for a loop invariant `scopedValue` should >> also be legal in most cases. >> >> `ScopedValue.get()` is implemented in java code as a 2 step process. A >> cache is attached to the current thread object. If the `ScopedValue` >> object is in the cache then the result from `get()` is read from >> there. Otherwise a slow call is performed that also inserts the >> mapping in the cache. The cache itself is lazily allocated. One >> `ScopedValue` can be hashed to 2 different indexes in the cache. On a >> cache probe, both indexes are checked. As a consequence, the process >> of probing the cache is a multi step process (check if the cache is >> present, check first index, check second index if first index >> failed). If the cache is populated early on, then when the method that >> calls `ScopedValue.get()` is compiled, profile reports the slow path >> as never taken and only the read from the cache is compiled. >> >> To perform the optimizations, I added 3 new node types to C2: >> >> - the pair >> ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for >> the cache probe >> >> - a cfg node ScopedValueGetResultNode to help locate the result of the >> `get()` call in the IR graph. >> >> In pseudo code, once the nodes are inserted, the code of a `get()` is: >> >> >> hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) >> if (hits_in_the_cache) { >> res = ScopedValueGetLoadFromCache(hits_in_the_cache); >> } else { >> res = ..; //slow call possibly inlined. Subgraph can be arbitray complex >> } >> res = ScopedValueGetResult(res) >> >> >> In the snippet: >> >> >> v1 = scopedValue.get(); >> ... >> v2 = scopedValue.get(); >> >> >> Replacing `v2` by `v1` is then done by starting from the >> `ScopedValueGetResult` node for the second `get()` and looking for a >> dominating `ScopedValueGetResult` for the same `ScopedValue` >> object. When one is found, it is used as a replacement. Eliminating >> the second `get()` call is achieved by making >> `ScopedValueGetHitsInCache` always successful if there's a dominating >> `Scoped... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > 32 bit build fix I pushed a new set of changes that: 1) address most of your comments 2) fix the merge conflict. I didn't make the change you suggested to the comments because, for pattern matching, I use the actual java code from `ScopedValue.get()`. I think it's easier that way to see what's being pattern matched. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1980916522 From roland at openjdk.org Wed Mar 6 13:51:16 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 13:51:16 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: On Mon, 26 Feb 2024 16:09:09 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > src/hotspot/share/opto/loopTransform.cpp line 3790: > >> 3788: phase->do_peeling(this, old_new); >> 3789: return false; >> 3790: } > > Just because I'm curious: why do the other places not already peel these loops? I.e. why do we need this here? Peeling looks for a loop invariant condition with one branch that exits the loop because then peeling makes the test in the loop body redundant with the one in the peeled iteration. Here, if there's a `ScopedValue.get()` on a loop invariant `ScopedValue` object, peeling one iteration will make `ScopedValue.get()` in the loop body redundant with the one in the peeled iteration. So it's not quite the same, at least, because for `ScopedValue.get()` the optimization applies whether `ScopedValue.get()` causes an exit of the loop or not. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1514512542 From roland at openjdk.org Wed Mar 6 14:00:45 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Mar 2024 14:00:45 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Wed, 28 Feb 2024 08:58:23 GMT, Christian Hagedorn wrote: >> Long counted loop are transformed into a loop nest of 2 "regular" >> loops and in a subsequent loop opts round, the inner loop is >> transformed into a counted loop. The limit for the inner loop is set, >> when the loop nest is created, so it's expected there's no need for a >> loop limit check when the counted loop is created. The assert fires >> because, when the counted loop is created, it is found that it needs a >> loop limit check. The reason for that is that the limit is >> transformed, between nest creation and counted loop creation, in a way >> that the range of values of the inner loop's limit becomes >> unknown. The limit when the nest is created is: >> >> >> 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 >> 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) >> 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] >> 122 ConvL2I === _ 112 [[ ]] #int >> >> >> The type of 122 is `2..6` but it is then transformed to: >> >> >> 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) >> 191 ConvL2I === _ 106 [[ 196 ]] #int >> 195 ConI === 0 [[ 196 ]] #int:max-1 >> 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] >> >> >> That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI >> (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of >> values returns TypeInt::INT and the bounds of the limit are lost. I >> propose adding a `CastII` after the `ConvL2I` so the range of values >> of the limit doesn't get lost. > > src/hotspot/share/opto/loopnode.cpp line 955: > >> 953: // opts pass, an accurate range of values for the limits is found. >> 954: const TypeInt* inner_iters_actual_int_range = TypeInt::make(0, iters_limit, Type::WidenMin); >> 955: inner_iters_actual_int = new CastIINode(outer_head, inner_iters_actual_int, inner_iters_actual_int_range, ConstraintCastNode::UnconditionalDependency); > > The fix idea looks reasonable to me. I have two questions: > - Do we really need to pin the `CastII` here? We have not pinned the `ConvL2I` before. And here I think we just want to ensure that the type is not lost. > - Related to the first question, could we just use a normal dependency instead? > > I was also wondering if we should try to improve the type of `ConvL2I` and of `Add/Sub` (and possibly also `Mul`) nodes in general? For `ConvL2I`, we could set a better type if we know that `(int)lo <= (int)hi` and `abs(hi - lo) <= 2^32`. We still have a problem to set a better type if we have a narrow range of inputs that includes `min` and `max` (e.g. `min+1, min, max, max-1`). In this case, `ConvL2I` just uses `int` as type. Then we could go a step further and do the same type optimization for `Add/Sub` nodes by directly looking through a convert/cast node at the input type. The resulting `Add/Sub` range could maybe be represented by something better than `int`: > > Example: > input type to `ConvL2I`: `[2147483647L, 2147483648L]` -> type of `ConvL2I` is `int` since we cannot represent "`[max_int, min_int]`" with two intervals otherwise. > `AddI` = `ConvL2I` + 2 -> type could be improved to `[min_int+1,min_int+2]`. > > > But that might succeed the scope of this fix. Going with `CastII` for now seems to be the least risk. Thanks for reviewing this. > The fix idea looks reasonable to me. I have two questions: > > * Do we really need to pin the `CastII` here? We have not pinned the `ConvL2I` before. And here I think we just want to ensure that the type is not lost. I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. > * Related to the first question, could we just use a normal dependency instead? The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. > I was also wondering if we should try to improve the type of `ConvL2I` and of `Add/Sub` (and possibly also `Mul`) nodes in general? For `ConvL2I`, we could set a better type if we know that `(int)lo <= (int)hi` and `abs(hi - lo) <= 2^32`. We still have a problem to set a better type if we have a narrow range of inputs that includes `min` and `max` (e.g. `min+1, min, max, max-1`). In this case, `ConvL2I` just uses `int` as type. Then we could go a step further and do the same type optimization for `Add/Sub` nodes by directly looking through a convert/cast node at the input type. The resulting `Add/Sub` range could maybe be represented by something better than `int`: > > Example: input type to `ConvL2I`: `[2147483647L, 2147483648L]` -> type of `ConvL2I` is `int` since we cannot represent "`[max_int, min_int]`" with two intervals otherwise. `AddI` = `ConvL2I` + 2 -> type could be improved to `[min_int+1,min_int+2]`. > > But that might succeed the scope of this fix. Going with `CastII` for now seems to be the least risk. I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1514532046 From chagedorn at openjdk.org Wed Mar 6 14:22:11 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 14:22:11 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v5] In-Reply-To: References: Message-ID: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: format ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18080/files - new: https://git.openjdk.org/jdk/pull/18080/files/14b46ba6..9a3d97e3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18080&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18080.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18080/head:pull/18080 PR: https://git.openjdk.org/jdk/pull/18080 From gcao at openjdk.org Wed Mar 6 14:29:59 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 14:29:59 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v2] In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Fix for robehn comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18114/files - new: https://git.openjdk.org/jdk/pull/18114/files/893da741..e4bc8405 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=00-01 Stats: 18 lines in 1 file changed: 17 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From gcao at openjdk.org Wed Mar 6 14:37:01 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 14:37:01 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v3] In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build Gui Cao has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Fix for robehn comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18114/files - new: https://git.openjdk.org/jdk/pull/18114/files/e4bc8405..0c7a6780 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From gcao at openjdk.org Wed Mar 6 14:50:47 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 6 Mar 2024 14:50:47 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Wed, 6 Mar 2024 12:20:43 GMT, Robbin Ehn wrote: >> The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. >> So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) >> >> The MaxVectorSize is defined if JVMCI and/or C2 is defined: >> `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > >> I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. > > This also makes it clear that C1/interpreter don't use them, hence if someone needs a speed up there they could try to make use of them. @robehn @RealFYang Thanks for your review, I've put those functions definitions into a #if COMPILER2_OR_JVMCI block to avoid such a problem. The reason why #ifdef COMPILER2_OR_JVMCI is not used is because it might be defined as `#define COMPILER2_OR_JVMCI 0` Could you please look at it again? // COMPILER2 or JVMCI #if defined(COMPILER2) || INCLUDE_JVMCI #define COMPILER2_OR_JVMCI 1 #define COMPILER2_OR_JVMCI_PRESENT(code) code #define NOT_COMPILER2_OR_JVMCI(code) #define NOT_COMPILER2_OR_JVMCI_RETURN /* next token must be ; */ #define NOT_COMPILER2_OR_JVMCI_RETURN_(code) /* next token must be ; */ #else #define COMPILER2_OR_JVMCI 0 #define COMPILER2_OR_JVMCI_PRESENT(code) #define NOT_COMPILER2_OR_JVMCI(code) code #define NOT_COMPILER2_OR_JVMCI_RETURN {} #define NOT_COMPILER2_OR_JVMCI_RETURN_(code) { return code; } #endif ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1981034579 From chagedorn at openjdk.org Wed Mar 6 14:52:46 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 14:52:46 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Thu, 22 Feb 2024 14:36:52 GMT, Roland Westrelin wrote: > Long counted loop are transformed into a loop nest of 2 "regular" > loops and in a subsequent loop opts round, the inner loop is > transformed into a counted loop. The limit for the inner loop is set, > when the loop nest is created, so it's expected there's no need for a > loop limit check when the counted loop is created. The assert fires > because, when the counted loop is created, it is found that it needs a > loop limit check. The reason for that is that the limit is > transformed, between nest creation and counted loop creation, in a way > that the range of values of the inner loop's limit becomes > unknown. The limit when the nest is created is: > > > 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 > 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] > 122 ConvL2I === _ 112 [[ ]] #int > > > The type of 122 is `2..6` but it is then transformed to: > > > 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 191 ConvL2I === _ 106 [[ 196 ]] #int > 195 ConI === 0 [[ 196 ]] #int:max-1 > 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] > > > That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI > (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of > values returns TypeInt::INT and the bounds of the limit are lost. I > propose adding a `CastII` after the `ConvL2I` so the range of values > of the limit doesn't get lost. Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17965#pullrequestreview-1919967526 From chagedorn at openjdk.org Wed Mar 6 14:52:47 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Mar 2024 14:52:47 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 13:57:53 GMT, Roland Westrelin wrote: > I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. That is indeed a general problem. The situation certainly got better by removing the code that optimized cast nodes that were pinned at If Projections (https://github.com/openjdk/jdk/commit/7766785098816cfcdae3479540cdc866c1ed18ad). By pinning the casts now, you probably want to prevent the cast nodes to be pushed through nodes such that it floats "too high" and causing unforeseenable data graph folding while control is not? > The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. I see, thanks for the explanation. Then it makes sense to keep the cast node not matter what. > I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. I agree, this fix should use casts. Would be interesting to follow this idea in a separate RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1514615913 From kvn at openjdk.org Wed Mar 6 17:04:51 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Mar 2024 17:04:51 GMT Subject: RFR: 8326983: Unused operands reported after JDK-8326135 [v4] In-Reply-To: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> References: <8heX7a8FEHXGS5WM2ryS63uCJmQcsATp-hULdo87IDw=.f8d56906-dd4a-40ac-9538-6affe9b5353c@github.com> Message-ID: On Tue, 5 Mar 2024 06:40:00 GMT, kuaiwei wrote: >> Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. >> I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > 1 check _disable_warnings in adlc 2 Fix error in arm_32.ad Latest version v03 testing passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18075#issuecomment-1981360465 From duke at openjdk.org Wed Mar 6 17:04:52 2024 From: duke at openjdk.org (kuaiwei) Date: Wed, 6 Mar 2024 17:04:52 GMT Subject: Integrated: 8326983: Unused operands reported after JDK-8326135 In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 05:46:57 GMT, kuaiwei wrote: > Remove all unused operands reported by adlc. I'm testing x86_64 and aarch64. So far no failure found in tier1. > I tried to clean unused operands for all platform. A special case is immLRot2 in arm, it's not used, but appeared in many todo comments. So I keep it. This pull request has now been integrated. Changeset: e92ecd97 Author: Kuai Wei Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/e92ecd9703e0a4f71d52a159516785a3eab5195a Stats: 993 lines in 10 files changed: 17 ins; 966 del; 10 mod 8326983: Unused operands reported after JDK-8326135 Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/18075 From sviswanathan at openjdk.org Wed Mar 6 22:04:56 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 6 Mar 2024 22:04:56 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 22:37:49 GMT, Dean Long wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> unify the implementation > > So if we can still generate the non-AVX encoding of > > `roundsd dst, src, mode` > > isn't there still a false dependency problem with `dst`? @dean-long You bring up a very good point. The SSE instruction (roundsd dst, src, mode) also has a false dependency problem. This can be demonstrated by adding the following benchmark to MathBench.java: diff --git a/test/micro/org/openjdk/bench/java/lang/MathBench.java b/test/micro/org/openjdk/bench/java/lang/MathBench.java index c7dde019154..feb472bba3d 100644 --- a/test/micro/org/openjdk/bench/java/lang/MathBench.java +++ b/test/micro/org/openjdk/bench/java/lang/MathBench.java @@ -141,6 +141,11 @@ public double ceilDouble() { return Math.ceil(double4Dot1); } + @Benchmark + public double useAfterCeilDouble() { + return Math.ceil(double4Dot1) + Math.floor(double4Dot1); + } + @Benchmark public double copySignDouble() { return Math.copySign(double81, doubleNegative12); The fix would be to do a pxor on dst before the SSE roundsd instruction, something like below: diff --git a/src/hotspot/cpu/x86/x86.ad b/src/hotspot/cpu/x86/x86.ad index cf4aef83df2..eb6701f82a7 100644 --- a/src/hotspot/cpu/x86/x86.ad +++ b/src/hotspot/cpu/x86/x86.ad @@ -3874,6 +3874,9 @@ instruct roundD_reg(legRegD dst, legRegD src, immU8 rmode) %{ ins_cost(150); ins_encode %{ assert(UseSSE >= 4, "required"); + if ((UseAVX == 0) && ($dst$$XMMRegister != $src$$XMMRegister)) { + __ pxor($dst$$XMMRegister, $dst$$XMMRegister); + } __ roundsd($dst$$XMMRegister, $src$$XMMRegister, $rmode$$constant); %} ins_pipe(pipe_slow); ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1981879809 From fyang at openjdk.org Thu Mar 7 02:32:53 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 7 Mar 2024 02:32:53 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 04:06:47 GMT, Gui Cao wrote: > Hi, I noticed that RISC-V missed this change from #11044 [1]: > > `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` > > `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` > > > After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. > > [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 > > ### Tests > > - [x] Run tier1-3 tests on SiFive unmatched (release) Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18131#pullrequestreview-1921298462 From ddong at openjdk.org Thu Mar 7 03:00:00 2024 From: ddong at openjdk.org (Denghui Dong) Date: Thu, 7 Mar 2024 03:00:00 GMT Subject: RFR: 8327379: Make TimeLinearScan a develop flag [v2] In-Reply-To: References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: <3t8sJ2UoFZRS_8XbDrlNRawPW7ZgpPNwqgbyVUcJNiI=.d1f693bf-e8d6-4ce5-8596-019cae1918a1@github.com> On Wed, 6 Mar 2024 06:16:54 GMT, Guoxiong Li wrote: >> Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: >> >> update header > > The patch looks good. But I don't really know whether it deserves to do that. @lgxbslgx @vnkozlov Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18125#issuecomment-1982245847 From ddong at openjdk.org Thu Mar 7 03:00:00 2024 From: ddong at openjdk.org (Denghui Dong) Date: Thu, 7 Mar 2024 03:00:00 GMT Subject: Integrated: 8327379: Make TimeLinearScan a develop flag In-Reply-To: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> References: <0qF035kq8V0ehMhAdf1_ptnYwXHdtifMIazkoSn5laI=.54fb7a02-ca09-4f25-a920-1b13ff49ff78@github.com> Message-ID: <9qY_rjrToWVdaI3scSYxILx8nCnPsxvPTPWQFgjGFTI=.6d23a92d-a168-4267-9406-2612907f54f1@github.com> On Tue, 5 Mar 2024 15:54:34 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that makes TimeLinearScan a develop flag. > > Currently, TimeLinearScan is only used in code guarded by '#ifndef PRODUCT'. We should move it to develop or maybe notproduct. This pull request has now been integrated. Changeset: 40183412 Author: Denghui Dong URL: https://git.openjdk.org/jdk/commit/401834122dc3afb3feb9f7b31fc785de82ba2e58 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8327379: Make TimeLinearScan a develop flag Reviewed-by: gli, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18125 From dlong at openjdk.org Thu Mar 7 03:15:55 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 7 Mar 2024 03:15:55 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:12:12 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. I'm looking at it again, and I'm trying to figure how we can minimize platform-specific changes. I'm hoping we can move some of the set_force_reexecute boiler-plate code into shared code. We probably don't need _force_reexecute in CodeEmitInfo anymore. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1982266410 From fyang at openjdk.org Thu Mar 7 04:24:56 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 7 Mar 2024 04:24:56 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v3] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: <-t4-oX3IpX98aNrYJrhKGFypWKcRZ6hLFmacEircyM4=.41bb78d2-80e2-40d4-b489-3ca99f3b297e@github.com> On Wed, 6 Mar 2024 14:37:01 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Fix for robehn comment Hi, I think we can simply move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1982321606 From gcao at openjdk.org Thu Mar 7 06:19:03 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 06:19:03 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build Gui Cao has updated the pull request incrementally with two additional commits since the last revision: - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18114/files - new: https://git.openjdk.org/jdk/pull/18114/files/0c7a6780..a2199a46 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18114&range=02-03 Stats: 685 lines in 1 file changed: 317 ins; 334 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/18114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18114/head:pull/18114 PR: https://git.openjdk.org/jdk/pull/18114 From epeter at openjdk.org Thu Mar 7 06:55:59 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 06:55:59 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> On Wed, 6 Mar 2024 08:36:57 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/memnode.cpp line 2802: > >> 2800: StoreNode* use = can_merge_primitive_array_store_with_use(phase, true); >> 2801: if (use != nullptr) { >> 2802: return nullptr; > > Do you want to assert that the use is in the igvn worklist? Hmm. I think that would not be a good assert. Let's assume we have 4 stores that would merge. Then the last one of them does the merging, and replaces itself with the merged store. If `this` is the second store, then it could merge with its def (the first store). But since it has a use that could also be merged with, we delegate the merging down. But it is not done by the 3rd store, rather the 4th. So we cannot assert that the 3rd would be in the worklist. The 3rd may have been processed before, and determined that it does not want to idealize itself, and be removed from the worklist. Maybe I can improve the comment: // Merging is done by the last store in a chain. We have a use that could be merged with, so we // are not the last store, and hence must wait for some (recursive) use to do the merge. > src/hotspot/share/opto/memnode.cpp line 3146: > >> 3144: Node* ctrl_s1 = s1->in(MemNode::Control); >> 3145: Node* ctrl_s2 = s2->in(MemNode::Control); >> 3146: if (ctrl_s1 != ctrl_s2) { > > Do you need to check that `ctrl_s1` and `ctrl_s2` are not null? I suppose this could be called on a dying part of the graph during igvn. @rwestrel but then would they not be `TOP` rather than `nullptr`? > src/hotspot/share/opto/memnode.hpp line 578: > >> 576: >> 577: Node* Ideal_merge_primitive_array_stores(PhaseGVN* phase); >> 578: StoreNode* can_merge_primitive_array_store_with_use(PhaseGVN* phase, bool check_def); > > If I understand correctly you need the `check_def ` parameter to avoid having `can_merge_primitive_array_store_with_use` and `can_merge_primitive_array_store_with_def` call each other indefinitely. But if I was to write new code that takes advantage of one of the two methods, I think I would be puzzled that there's a `check_def` parameter. Passing `false` would be wrong then but maybe not immediately obvious. Maybe it would be better to have `can_merge_primitive_array_store_with_def` with no `check_def` parameter and have all the work done in a utility method that takes a `check_def` parameter (always `true` when called from `can_merge_primitive_array_store_with_def`) You are right, this is not the best code pattern. I'll refactor it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515627795 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515629804 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515628308 From epeter at openjdk.org Thu Mar 7 06:58:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 06:58:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Wed, 6 Mar 2024 08:52:16 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/memnode.cpp line 3154: > >> 3152: } >> 3153: ProjNode* other_proj = ctrl_s1->as_IfProj()->other_if_proj(); >> 3154: if (other_proj->is_uncommon_trap_proj(Deoptimization::Reason_range_check) == nullptr || > > This could be a range check for an unrelated array I suppose. Does it matter? I don't think it matters, no. Do you see a scenario where it would matter? My argument: It is safe to do the stores after the RC rather than before it. And if the RC trap relies on the memory state of the stores that were before the RC, then those stores simply don't lose all their uses, and stay in the graph. After all, we only remove the "last" store by replacing it with the merged store, so the other stores only disappear if they have no other use. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515632976 From gcao at openjdk.org Thu Mar 7 07:11:55 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 07:11:55 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Thu, 7 Mar 2024 06:19:03 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has updated the pull request incrementally with two additional commits since the last revision: > > - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block > - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize Hi, I've made the changes for the review, and the minimal/server build was successful. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1982666432 From fyang at openjdk.org Thu Mar 7 07:14:54 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 7 Mar 2024 07:14:54 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Thu, 7 Mar 2024 06:19:03 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has updated the pull request incrementally with two additional commits since the last revision: > > - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block > - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18114#pullrequestreview-1921613004 From epeter at openjdk.org Thu Mar 7 07:47:57 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:47:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v6] In-Reply-To: References: Message-ID: On Tue, 30 Jan 2024 14:43:41 GMT, Roland Westrelin wrote: >> @shipilev >> You are right, I need to guard the optimization with `UseUnalignedAccesses`. Just added it. Thanks you ? >> Probably my tests would have run into the `SIGBUS` you mentioned. >> >> About `InitializeNode::coalesce_subword_stores`: >> It only works on raw-stores, which write fields before the initialization of an object. It only works with constants. >> Hence, the pattern is quite different. >> Merging the two would be a lot of work. Too much for me for now. >> But maybe one day we can cover all these cases in a single optimization, that merges/coalesces all sorts of loads and stores, and essencially vectorizes any straingt-line code, at least for loads and stores. >> For now, I just wanted to add the feature that @cl4es and @RogerRiggs were specifically asking for, which is merging array stores for constants and variables (using shift to split). >> >> @rwestrel >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past `ConvI2L` and `CastII` nodes. > >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past ConvI2L and CastII nodes. > > Do you still need a traversal of the graph to find the Stores or can you enqueue them for post loop opts then? @rwestrel > Do you intend to add an IR test case? I already have IR tests that also do result verification: `test/hotspot/jtreg/compiler/c2/TestMergeStores.java` ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1982785244 From epeter at openjdk.org Thu Mar 7 07:47:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:47:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Wed, 6 Mar 2024 08:58:01 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> a little bit of casting for debug printing code > > src/hotspot/share/opto/memnode.cpp line 2971: > >> 2969: // The goal is to check if two such ArrayPointers are adjacent for a load or store. >> 2970: // >> 2971: // Note: we accumulate all constant offsets into constant_offset, even the int constant behind > > Is this really needed? For the patterns of interest, aren't the constant pushed down the chain of `AddP` nodes so the address is `(AddP base (AddP ...) constant)`? No, they are not pushed down. Consider the access on an int array: `a[invar + 1]` -> `adr = base + ARRAY_INT_BASE_OFFSET + 4 * ConvI2L(invar + 1)` We cannot just push the constant `1` out of the `ConvI2L`, after all `invar + 1` could overflow in the int domain ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1515686654 From epeter at openjdk.org Thu Mar 7 07:51:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:51:58 GMT Subject: RFR: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 16:59:03 GMT, Christian Hagedorn wrote: >> This is a regression fix from https://github.com/openjdk/jdk/pull/17657. >> >> I had never encountered an example where a data node in the loop body did not have any input node in the loop. >> My assumption was that this should never happen, such a node should move out of the loop itself. >> >> I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. >> >> https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 >> >> I now had a few options: >> 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. >> 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. >> 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. >> >> I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. > > That looks reasonable. I agree to fix the ctrl issues separately and go with a bailout solution for now. Maybe you want to add a note at [JDK-8307982](https://bugs.openjdk.org/browse/JDK-8307982) to not forget about this case here. Thanks @chhagedorn @vnkozlov for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18123#issuecomment-1982793118 From epeter at openjdk.org Thu Mar 7 07:51:59 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:51:59 GMT Subject: Integrated: 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout In-Reply-To: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> References: <4CbSaPG5DEt47GreYegAKBIbf8TJyTio7-y2gBg0WN8=.419560f1-cbda-41da-b68f-76d72ee9c489@github.com> Message-ID: On Tue, 5 Mar 2024 14:53:33 GMT, Emanuel Peter wrote: > This is a regression fix from https://github.com/openjdk/jdk/pull/17657. > > I had never encountered an example where a data node in the loop body did not have any input node in the loop. > My assumption was that this should never happen, such a node should move out of the loop itself. > > I now encountered such an example. But I think it shows that there are cases where we compute the ctrl wrong. > > https://github.com/openjdk/jdk/blob/8835f786b8dc7db1ebff07bbb3dbb61a6c42f6c8/test/hotspot/jtreg/compiler/loopopts/superword/TestNoInputInLoop.java#L65-L73 > > I now had a few options: > 1. Revert to the code before https://github.com/openjdk/jdk/pull/17657: handle such cases with the extra `data_entry` logic. But this would just be extra complexity for patterns that shoud not exist in the first place. > 2. Fix the computation of ctrl. But we know that there are many edge cases that are currently wrong, and I am working on verification and fixing these issues in https://github.com/openjdk/jdk/pull/16558. So I would rather fix those pre-existing issues separately. > 3. Just create a silent bailout from vectorization, with `VStatus::make_failure`. > > I chose option 3, since it allows simple logic, and only prevents vectorization in cases that are already otherwise broken. This pull request has now been integrated. Changeset: f54e5983 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/f54e59835492e86b9178b2050901579707f41100 Stats: 103 lines in 3 files changed: 100 ins; 1 del; 2 mod 8327172: C2 SuperWord: data node in loop has no input in loop: replace assert with bailout Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/18123 From epeter at openjdk.org Thu Mar 7 07:56:03 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 07:56:03 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> Message-ID: <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> On Wed, 6 Mar 2024 13:48:36 GMT, Roland Westrelin wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> 32 bit build fix > > I pushed a new set of changes that: > 1) address most of your comments > 2) fix the merge conflict. > I didn't make the change you suggested to the comments because, for pattern matching, I use the actual java code from `ScopedValue.get()`. I think it's easier that way to see what's being pattern matched. @rwestrel nice! I'll run our testing again, now that it is merged. FYI: you have some whitespace issues in: `src/hotspot/share/opto/callGenerator.cpp` ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1982809402 From rehn at openjdk.org Thu Mar 7 07:56:54 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 7 Mar 2024 07:56:54 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 [v4] In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Thu, 7 Mar 2024 06:19:03 GMT, Gui Cao wrote: >> Hi, please review this patch that fix the minimal build failed for riscv. >> >> Error log for minimal build: >> >> Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) >> ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 >> gmake[3]: *** Waiting for unfinished jobs.... >> ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 >> gmake[2]: *** Waiting for unfinished jobs.... >> ^@ >> ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) >> >> === Output from failing command(s) repeated here === >> * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: >> /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? >> 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 >> | ^~~~~~~~~~~~~ >> | MaxNewSize >> >> * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. >> === End of repeated output === >> >> No indication of failed target found. >> HELP: Try searching the build log for '] Error'. >> HELP: Run 'make doctor' to diagnose build problems. >> >> make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 >> make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 >> >> >> The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. >> >> Testing: >> >> - [... > > Gui Cao has updated the pull request incrementally with two additional commits since the last revision: > > - Move the sha256/512 part together with code for md5, chacha20 and sha1 and add put them into a single #if COMPILER2_OR_JVMCI block > - Revert use VM_Version::_initial_vector_length instead of MaxVectorSize Thanks! ------------- Marked as reviewed by rehn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18114#pullrequestreview-1921686927 From roland at openjdk.org Thu Mar 7 08:15:17 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 08:15:17 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v9] In-Reply-To: <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> References: <5H6XV7Agl6ZNfGWT-bCbIPsimFTYM0pyIGiAHDQUUyA=.168e21cc-6cd8-42d8-ab59-d5e02e241ea2@github.com> <0RKnLUgc6UBtyxSyezCMWsSbP50hu6fQ6UJPHpGlgSU=.9fafa10f-62ee-4ec8-9093-4e204fcbe504@github.com> Message-ID: On Thu, 7 Mar 2024 07:53:02 GMT, Emanuel Peter wrote: > FYI: you have some whitespace issues in: `src/hotspot/share/opto/callGenerator.cpp` Thanks. I missed it. Fixed now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1982870654 From roland at openjdk.org Thu Mar 7 08:15:17 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 08:15:17 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v11] In-Reply-To: References: Message-ID: > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: whitespaces ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16966/files - new: https://git.openjdk.org/jdk/pull/16966/files/57592601..361a6ab7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=09-10 Stats: 24 lines in 1 file changed: 0 ins; 0 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From epeter at openjdk.org Thu Mar 7 08:26:53 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:26:53 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: <-CsTGyIK4TUjYf3tHEdBQqivYrm3oju-J7rFgb9IvEw=.968e0f43-02eb-4491-9a37-31fc72be2445@github.com> On Mon, 26 Feb 2024 23:23:57 GMT, Joshua Cao wrote: >> For example, `x + 1 < 2` -> `x < 2 - 1` iff we can prove that `x + 1` does not overflow and `2 - 1` does not overflow. We can always fold if it is an `==` or `!=` since overflow will not affect the result of the comparison. >> >> Consider this more practical example: >> >> >> public void foo(int[] arr) { >> for (i = arr.length - 1; i >= 0; --i) { >> blackhole(arr[i]); >> } >> } >> >> >> C2 emits a loop guard that looks `arr.length - 1 < 0`. We know `arr.length - 1` does not overflow because `arr.length` is positive. We can fold the comparison into `arr.length < 1`. We have to compute `arr.length - 1` computation if we enter the loop anyway, but we can avoid the subtraction computation if we never enter the loop. I believe the simplification can also help with stronger integer range analysis in https://bugs.openjdk.org/browse/JDK-8275202. >> >> Some additional notes: >> * there is various overflow checking code across `src/hotspot/share/opto`. I separated about the functions from convertnode.cpp into `type.hpp`. Maybe the functions belong somewhere else? >> * there is a change in Parse::do_if() to repeatedly apply GVN until the test is canonical. We need multiple iterations in the case of `C1 > C2 - X` -> `C2 - X < C1` -> `C2 < X` -> `X > C2`. This fails the assertion if `BoolTest(btest).is_canonical()`. We can avoid this by applying GVN one more time to get `C2 < X`. >> * we should not transform loop backedge conditions. For example, if we have `for (i = 0; i < 10; ++i) {}`, the backedge condition is `i + 1 < 10`. If we transform it into `i < 9`, it messes with CountedLoop's recognition of induction variables and strides.r >> * this change optimizes some of the equality checks in `TestUnsignedComparison.java` and breaks the IR checks. I removed those tests. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > comments with explanations and style changes Ok. I discussed it quickly with @vnkozlov. He said we should be careful, and I'll have to run high tier testing on our side and some performance testing as well. But we can go ahead. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17853#issuecomment-1982905864 From epeter at openjdk.org Thu Mar 7 08:31:54 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:31:54 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: On Mon, 26 Feb 2024 23:23:57 GMT, Joshua Cao wrote: >> For example, `x + 1 < 2` -> `x < 2 - 1` iff we can prove that `x + 1` does not overflow and `2 - 1` does not overflow. We can always fold if it is an `==` or `!=` since overflow will not affect the result of the comparison. >> >> Consider this more practical example: >> >> >> public void foo(int[] arr) { >> for (i = arr.length - 1; i >= 0; --i) { >> blackhole(arr[i]); >> } >> } >> >> >> C2 emits a loop guard that looks `arr.length - 1 < 0`. We know `arr.length - 1` does not overflow because `arr.length` is positive. We can fold the comparison into `arr.length < 1`. We have to compute `arr.length - 1` computation if we enter the loop anyway, but we can avoid the subtraction computation if we never enter the loop. I believe the simplification can also help with stronger integer range analysis in https://bugs.openjdk.org/browse/JDK-8275202. >> >> Some additional notes: >> * there is various overflow checking code across `src/hotspot/share/opto`. I separated about the functions from convertnode.cpp into `type.hpp`. Maybe the functions belong somewhere else? >> * there is a change in Parse::do_if() to repeatedly apply GVN until the test is canonical. We need multiple iterations in the case of `C1 > C2 - X` -> `C2 - X < C1` -> `C2 < X` -> `X > C2`. This fails the assertion if `BoolTest(btest).is_canonical()`. We can avoid this by applying GVN one more time to get `C2 < X`. >> * we should not transform loop backedge conditions. For example, if we have `for (i = 0; i < 10; ++i) {}`, the backedge condition is `i + 1 < 10`. If we transform it into `i < 9`, it messes with CountedLoop's recognition of induction variables and strides.r >> * this change optimizes some of the equality checks in `TestUnsignedComparison.java` and breaks the IR checks. I removed those tests. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > comments with explanations and style changes src/hotspot/share/opto/subnode.cpp line 1586: > 1584: } > 1585: } > 1586: } This looks like heavy code duplication. Can you refactor this? Maybe a helper method? src/hotspot/share/opto/type.cpp line 1761: > 1759: } > 1760: return true; > 1761: } Do you maybe want to assert that no other opcode comes in? Or is there a need for non add/sub opcodes to be passed in? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1515739956 PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1515741084 From epeter at openjdk.org Thu Mar 7 08:51:57 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:51:57 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: References: Message-ID: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> On Wed, 6 Mar 2024 06:13:02 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Change transform to work on CMoves Nice work, I think this looks much better now! I'm currently a bit tight on time, I'll run the benchmark on my next pass ;) src/hotspot/share/opto/movenode.cpp line 189: > 187: > 188: // Try to identify min/max patterns in CMoves > 189: static Node* is_minmax(PhaseGVN* phase, Node* cmov) { I'm not a fan of `is_...` methods that do more than a check, but actually have a side-effect. I also suggest that `cmov` should already have a `CMovNode` type, and there should be an assert here. I would probably do it similar to `AddNode::IdealIL` and `AddPNode::Ideal_base_and_offset`: call it `CMoveNode::IdealIL_minmax`. But add an assert to check for int or long. src/hotspot/share/opto/movenode.cpp line 322: > 320: if (phase->C->post_loop_opts_phase()) { > 321: return nullptr; > 322: } Putting the condition here would prevent any future optimization further down not to be executed. I think you should rather put this into the `is_minmax` method. Maybe this condition is now only relevant for `long`, but I think it would not hurt to also have it also for `int`, right? test/hotspot/jtreg/compiler/c2/irTests/TestIfMinMax.java line 139: > 137: public long testMaxL2E(long a, long b) { > 138: return a <= b ? b : a; > 139: } I assume some of the `long` patterns should also have become MaxL/MinL in some phase, right? Is there maybe some phase where the IR would actually show that? You can target the IR rule to a phase, I think. Would be worth a try. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17574#pullrequestreview-1921796112 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515754879 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515765294 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515768176 From epeter at openjdk.org Thu Mar 7 08:51:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:51:58 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> Message-ID: <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> On Thu, 7 Mar 2024 08:38:42 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Change transform to work on CMoves > > src/hotspot/share/opto/movenode.cpp line 189: > >> 187: >> 188: // Try to identify min/max patterns in CMoves >> 189: static Node* is_minmax(PhaseGVN* phase, Node* cmov) { > > I'm not a fan of `is_...` methods that do more than a check, but actually have a side-effect. > I also suggest that `cmov` should already have a `CMovNode` type, and there should be an assert here. > > I would probably do it similar to `AddNode::IdealIL` and `AddPNode::Ideal_base_and_offset`: call it `CMoveNode::IdealIL_minmax`. But add an assert to check for int or long. And then you could actualy move the call to `CMoveNode::Ideal`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515761304 From epeter at openjdk.org Thu Mar 7 08:51:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 08:51:58 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> Message-ID: <7vRLiWJ_2IIkKnFbdwNqg_fKT3WYuvj7YZCXcKx1cFE=.d4c1fd64-24cc-40e0-8707-4eed20a26135@github.com> On Thu, 7 Mar 2024 08:42:42 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/movenode.cpp line 189: >> >>> 187: >>> 188: // Try to identify min/max patterns in CMoves >>> 189: static Node* is_minmax(PhaseGVN* phase, Node* cmov) { >> >> I'm not a fan of `is_...` methods that do more than a check, but actually have a side-effect. >> I also suggest that `cmov` should already have a `CMovNode` type, and there should be an assert here. >> >> I would probably do it similar to `AddNode::IdealIL` and `AddPNode::Ideal_base_and_offset`: call it `CMoveNode::IdealIL_minmax`. But add an assert to check for int or long. > > And then you could actualy move the call to `CMoveNode::Ideal`. Who knows, maybe we one day extend this to other types ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1515761693 From gcao at openjdk.org Thu Mar 7 09:06:53 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 09:06:53 GMT Subject: RFR: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Wed, 6 Mar 2024 12:20:43 GMT, Robbin Ehn wrote: >> The SHA intrinsic are only used in "LibraryCallKit::inline_digestBase_implCompress" and JVMCI. >> So I think these (plus md5 and chacha) should be put into a ifdef COMPILER2_OR_JVMCI block. (I was going todo that but it slipped my mind) >> >> The MaxVectorSize is defined if JVMCI and/or C2 is defined: >> `NOT_COMPILER2(product(intx, MaxVectorSize, 64,` > >> I agree with @robehn ! We can put those functions definitions into a ifdef COMPILER2_OR_JVMCI block to avoid such a problem. I don't see other uses of them for now. > > This also makes it clear that C1/interpreter don't use them, hence if someone needs a speed up there they could try to make use of them. @robehn @RealFYang : Thanks all for the review ------------- PR Comment: https://git.openjdk.org/jdk/pull/18114#issuecomment-1983032275 From gcao at openjdk.org Thu Mar 7 09:16:57 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 7 Mar 2024 09:16:57 GMT Subject: Integrated: 8327283: RISC-V: Minimal build failed after JDK-8319716 In-Reply-To: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> References: <7yhCYgpNLnSZ6P3nOaX5zPZ10oTxBLlcojF3aKifsIo=.470666a9-372e-447a-93af-80a3b651ff15@github.com> Message-ID: On Tue, 5 Mar 2024 07:41:05 GMT, Gui Cao wrote: > Hi, please review this patch that fix the minimal build failed for riscv. > > Error log for minimal build: > > Creating support/modules_libs/java.base/minimal/libjvm.so from 591 file(s) > ^@/home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > gmake[3]: *** [lib/CompileJvm.gmk:165: /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/hotspot/variant-minimal/libjvm/objs/stubGenerator_riscv.o] Error 1 > gmake[3]: *** Waiting for unfinished jobs.... > ^@gmake[2]: *** [make/Main.gmk:253: hotspot-minimal-libs] Error 2 > gmake[2]: *** Waiting for unfinished jobs.... > ^@ > ERROR: Build failed for target 'images' in configuration 'linux-riscv64-minimal-fastdebug' (exit code 2) > > === Output from failing command(s) repeated here === > * For target hotspot_variant-minimal_libjvm_objs_stubGenerator_riscv.o: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp: In member function ?u_char* StubGenerator::Sha2Generator::generate_sha2_implCompress(Assembler::SEW, bool)?: > /home/zifeihan/jdk/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3938:41: error: ?MaxVectorSize? was not declared in this scope; did you mean ?MaxNewSize?? > 3938 | if (vset_sew == Assembler::e64 && MaxVectorSize == 16) { // SHA512 and VLEN = 128 > | ^~~~~~~~~~~~~ > | MaxNewSize > > * All command lines available in /home/zifeihan/jdk/build/linux-riscv64-minimal-fastdebug/make-support/failure-logs. > === End of repeated output === > > No indication of failed target found. > HELP: Try searching the build log for '] Error'. > HELP: Run 'make doctor' to diagnose build problems. > > make[1]: *** [/home/zifeihan/jdk/make/Init.gmk:323: main] Error 2 > make: *** [/home/zifeihan/jdk/make/Init.gmk:189: images] Error 2 > > > The root cause is that MaxVectorSize is only defined under COMPILER2 , We should use VM_Version::_initial_vector_length instead of MaxVectorSize. > > Testing: > > - [x] linux-riscv minimal fastdebug native build This pull request has now been integrated. Changeset: 12617405 Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/1261740521e364cf40ca7ee160fc10c608d9ab71 Stats: 506 lines in 1 file changed: 253 ins; 252 del; 1 mod 8327283: RISC-V: Minimal build failed after JDK-8319716 Reviewed-by: fyang, rehn ------------- PR: https://git.openjdk.org/jdk/pull/18114 From duke at openjdk.org Thu Mar 7 09:42:07 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 09:42:07 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v3] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Fix bytecode length calculation in GenFullCP.java and add new imports in ClassWriterExt.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/03a5e325..c8315dea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=01-02 Stats: 6 lines in 2 files changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From enikitin at openjdk.org Thu Mar 7 12:36:11 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Thu, 7 Mar 2024 12:36:11 GMT Subject: RFR: 8327390: JitTester: Implement temporary folder functionality Message-ID: The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. ------------- Commit messages: - 8327390: JitTester: Implement temporary folder functionality Changes: https://git.openjdk.org/jdk/pull/18128/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18128&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327390 Stats: 76 lines in 4 files changed: 63 ins; 4 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/18128.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18128/head:pull/18128 PR: https://git.openjdk.org/jdk/pull/18128 From gli at openjdk.org Thu Mar 7 12:36:11 2024 From: gli at openjdk.org (Guoxiong Li) Date: Thu, 7 Mar 2024 12:36:11 GMT Subject: RFR: 8327390: JitTester: Implement temporary folder functionality In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 19:58:04 GMT, Evgeny Nikitin wrote: > The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. > > This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. Looks good. And the issue is not a public visible issue, . Please mark the issue as non-secret. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18128#pullrequestreview-1919323835 From lmesnik at openjdk.org Thu Mar 7 12:36:11 2024 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Thu, 7 Mar 2024 12:36:11 GMT Subject: RFR: 8327390: JitTester: Implement temporary folder functionality In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 19:58:04 GMT, Evgeny Nikitin wrote: > The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. > > This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. Marked as reviewed by lmesnik (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18128#pullrequestreview-1920425205 From duke at openjdk.org Thu Mar 7 13:56:07 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 13:56:07 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v4] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Update imports in GenManyIndyCorrectBootstrap.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/c8315dea..0ef0b28f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=02-03 Stats: 6 lines in 1 file changed: 0 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From duke at openjdk.org Thu Mar 7 14:04:07 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 14:04:07 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Fix typo in error message in GenManyIndyIncorrectBootstrap.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/0ef0b28f..89292423 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=03-04 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From duke at openjdk.org Thu Mar 7 14:07:56 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 7 Mar 2024 14:07:56 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 14:04:07 GMT, Oussama Louati wrote: >> Completion of the first version of the migration for several tests. >> >> These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: >> >> - Generate constant pool entries filled with method handles and method types. >> - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. >> - Produce many invokedynamic instructions with a specific constant pool entry. > > Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in error message in GenManyIndyIncorrectBootstrap.java I ran the JTreg test on this PR Head after full conversion of these tests, and nothing unusual happened, those aren't explicitly related to something else. ------------- PR Review: https://git.openjdk.org/jdk/pull/17834#pullrequestreview-1922528839 From rkennke at openjdk.org Thu Mar 7 14:39:53 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 7 Mar 2024 14:39:53 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:33:56 GMT, Galder Zamarre?o wrote: >> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: >> >> RISCV changes > > I think the changes look fine, but looking closer to the original PR, src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.hpp might also need adjusting. s390 and ppc are probably just fine. @galderz is it ok now? I assume it counts as trivial, too? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18120#issuecomment-1983641324 From roland at openjdk.org Thu Mar 7 14:51:58 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 14:51:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v6] In-Reply-To: References: Message-ID: <8W7bn8q19_y3Jan9YSHlX_pvi6q_jllpLpTuHXhSjFw=.0b9fceb7-0a3b-4587-8969-23a997a2dd74@github.com> On Tue, 30 Jan 2024 14:43:41 GMT, Roland Westrelin wrote: >> @shipilev >> You are right, I need to guard the optimization with `UseUnalignedAccesses`. Just added it. Thanks you ? >> Probably my tests would have run into the `SIGBUS` you mentioned. >> >> About `InitializeNode::coalesce_subword_stores`: >> It only works on raw-stores, which write fields before the initialization of an object. It only works with constants. >> Hence, the pattern is quite different. >> Merging the two would be a lot of work. Too much for me for now. >> But maybe one day we can cover all these cases in a single optimization, that merges/coalesces all sorts of loads and stores, and essencially vectorizes any straingt-line code, at least for loads and stores. >> For now, I just wanted to add the feature that @cl4es and @RogerRiggs were specifically asking for, which is merging array stores for constants and variables (using shift to split). >> >> @rwestrel >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past `ConvI2L` and `CastII` nodes. > >> Ok. Well in that case I might have to make a more intelligent pointer-analysis, and parse past ConvI2L and CastII nodes. > > Do you still need a traversal of the graph to find the Stores or can you enqueue them for post loop opts then? > @rwestrel > > > Do you intend to add an IR test case? > > I already have IR tests that also do result verification: `test/hotspot/jtreg/compiler/c2/TestMergeStores.java` I missed it. I expected it in `irTests` subdirectory. Why isn't that the case BTW? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1983668344 From roland at openjdk.org Thu Mar 7 14:58:57 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 14:58:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Thu, 7 Mar 2024 07:45:13 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/memnode.cpp line 2971: >> >>> 2969: // The goal is to check if two such ArrayPointers are adjacent for a load or store. >>> 2970: // >>> 2971: // Note: we accumulate all constant offsets into constant_offset, even the int constant behind >> >> Is this really needed? For the patterns of interest, aren't the constant pushed down the chain of `AddP` nodes so the address is `(AddP base (AddP ...) constant)`? > > No, they are not pushed down. > Consider the access on an int array: > `a[invar + 1]` -> `adr = base + ARRAY_INT_BASE_OFFSET + 4 * ConvI2L(invar + 1)` > We cannot just push the constant `1` out of the `ConvI2L`, after all `invar + 1` could overflow in the int domain ;) That's not quite right, I think. For instance, in this method: private static int test(int[] array, int i) { return array[i + 1]; } the final IR will have the `(AddP base (AddP ...) constant)` because `ConvI2LNode::Ideal` does more than checking for overflow. The actual transformation to that final shape must be delayed until after the CastII nodes are removed though. Why that's the case is puzzling actually because `CastIINode::Ideal()` has logic to push the AddI thru the `CastII` but it's disabled for range check `CastII` nodes. I noticed this while working on 8324517. My recollection was that `ConvI2LNode::Ideal` would push thru both the `CastII` and `ConvI2L` in one go so I wonder if it got broken at some point. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516290785 From epeter at openjdk.org Thu Mar 7 15:41:20 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 15:41:20 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v3] In-Reply-To: References: Message-ID: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: missing string Extra -> Memory change ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/67def8d2..31b65c6c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From roland at openjdk.org Thu Mar 7 15:29:00 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Mar 2024 15:29:00 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> Message-ID: On Thu, 7 Mar 2024 06:53:21 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/memnode.cpp line 3146: >> >>> 3144: Node* ctrl_s1 = s1->in(MemNode::Control); >>> 3145: Node* ctrl_s2 = s2->in(MemNode::Control); >>> 3146: if (ctrl_s1 != ctrl_s2) { >> >> Do you need to check that `ctrl_s1` and `ctrl_s2` are not null? I suppose this could be called on a dying part of the graph during igvn. > > @rwestrel but then would they not be `TOP` rather than `nullptr`? Maybe. I think the current practice is to be extra careful and assume any input can be null during igvn. What do you think @vnkozlov ? >> src/hotspot/share/opto/memnode.cpp line 3154: >> >>> 3152: } >>> 3153: ProjNode* other_proj = ctrl_s1->as_IfProj()->other_if_proj(); >>> 3154: if (other_proj->is_uncommon_trap_proj(Deoptimization::Reason_range_check) == nullptr || >> >> This could be a range check for an unrelated array I suppose. Does it matter? > > I don't think it matters, no. Do you see a scenario where it would matter? > > My argument: > It is safe to do the stores after the RC rather than before it. And if the RC trap relies on the memory state of the stores that were before the RC, then those stores simply don't lose all their uses, and stay in the graph. > After all, we only remove the "last" store by replacing it with the merged store, so the other stores only disappear if they have no other use. Is there a chance then that we store to the same element twice (once with the store that we wanted to remove but haven't and the merged store)? I don't think repeated stores like this happen anywhere else as a result of some transformation. Would it be legal wrt the java specs? Can it be observed from some other thread? I think it would be better to not have to answer these questions and find a way to do the transformation in a way that guarantees the same element is not stored to twice. Can the transformation be delayed until range check smearing has done its job? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516355973 PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516352600 From chagedorn at openjdk.org Thu Mar 7 15:57:02 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Mar 2024 15:57:02 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v3] In-Reply-To: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> References: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> Message-ID: On Thu, 7 Mar 2024 15:41:20 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > missing string Extra -> Memory change That's a nice refactoring. I only have some small comments. src/hotspot/share/opto/vectorization.hpp line 456: > 454: class VLoopDependencyGraph : public StackObj { > 455: private: > 456: class DependencyNode; I'm not sure if we should declare classes in the middle of another class. Should we move the forward declaration to the top of the file as done in other places as well? src/hotspot/share/opto/vectorization.hpp line 467: > 465: > 466: // Node depth in DAG: bb_idx -> depth > 467: GrowableArray _depth; Suggestion: GrowableArray _depths; src/hotspot/share/opto/vectorization.hpp line 469: > 467: GrowableArray _depth; > 468: > 469: protected: Why is this protected? src/hotspot/share/opto/vectorization.hpp line 545: > 543: void next(); > 544: bool done() const { return _current == nullptr; } > 545: Node* current() const { assert(!done(), "not done yet"); return _current; } For two statements, I suggest to go with multiple lines: Suggestion: Node* current() const { assert(!done(), "not done yet"); return _current; } ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17812#pullrequestreview-1922553044 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516224804 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516230789 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516231365 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516412346 From epeter at openjdk.org Thu Mar 7 15:56:04 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 15:56:04 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: <1j-xpH8yy_BR50jGFKAU1bGQP2M8nlnN4kTQ70xq-7M=.bb619092-7db8-4cbd-bed9-8ebfa92a8fb4@github.com> On Tue, 5 Mar 2024 15:55:12 GMT, Emanuel Peter wrote: >> This is a feature requiested by @RogerRiggs and @cl4es . >> >> **Idea** >> >> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code. >> >> This patch here supports a few simple use-cases, like these: >> >> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395 >> >> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456 >> >> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian): >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338 >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472 >> >> **Details** >> >> This draft currently implements the optimization in an additional special IGVN phase: >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485 >> >> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores: >> >> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806 >> >> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > a little bit of casting for debug printing code > > Is there a chance then that we store to the same element twice ... Would it be legal wrt the java specs? > > AFAIU, introducing writes that do not exist in original program is an easy way to break JMM conformance. If we merge the writes, we have to make sure the old writes are not done. You _need_ to run jcstress on this change, at very least. Ok. I have never heard of jcstress. But will look into it. Maybe I need to do some more careful checks to ensure that on the merged path there is only the merged store, and the other stores sink into the other paths. More complicated than I thought... ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1983815904 From chagedorn at openjdk.org Thu Mar 7 15:57:04 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Mar 2024 15:57:04 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> References: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> Message-ID: On Thu, 7 Mar 2024 15:33:21 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > rename extra -> memory src/hotspot/share/opto/vectorization.cpp line 225: > 223: DependencyNode* dn = new (_arena) DependencyNode(n, memory_pred_edges, _arena); > 224: _dependency_nodes.at_put_grow(_body.bb_idx(n), dn, nullptr); > 225: } The call to `add_node()` suggests that we add a node no matter what. I therefore suggest to either change `add_node` to something like `maybe_add_node` or do the check like that: if (memory_pred_edges.is_nonempty()) { add_node(n1, memory_pred_edges); } src/hotspot/share/opto/vectorization.cpp line 285: > 283: _memory_pred_edges(nullptr) > 284: { > 285: assert(memory_pred_edges.length() > 0, "not empty"); Suggestion: assert(memory_pred_edges.is_nonempty(), "not empty"); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516366524 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516376416 From epeter at openjdk.org Thu Mar 7 15:33:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 15:33:21 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: References: Message-ID: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: rename extra -> memory ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/08b8df3f..67def8d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=00-01 Stats: 31 lines in 2 files changed: 0 ins; 0 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Thu Mar 7 16:05:58 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 16:05:58 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Thu, 7 Mar 2024 14:55:53 GMT, Roland Westrelin wrote: >> No, they are not pushed down. >> Consider the access on an int array: >> `a[invar + 1]` -> `adr = base + ARRAY_INT_BASE_OFFSET + 4 * ConvI2L(invar + 1)` >> We cannot just push the constant `1` out of the `ConvI2L`, after all `invar + 1` could overflow in the int domain ;) > > That's not quite right, I think. For instance, in this method: > > private static int test(int[] array, int i) { > return array[i + 1]; > } > > the final IR will have the `(AddP base (AddP ...) constant)` because `ConvI2LNode::Ideal` does more than checking for overflow. The actual transformation to that final shape must be delayed until after the CastII nodes are removed though. Why that's the case is puzzling actually because `CastIINode::Ideal()` has logic to push the AddI thru the `CastII` but it's disabled for range check `CastII` nodes. I noticed this while working on 8324517. My recollection was that `ConvI2LNode::Ideal` would push thru both the `CastII` and `ConvI2L` in one go so I wonder if it got broken at some point. Thanks for info, I'll look into this :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516427023 From shade at openjdk.org Thu Mar 7 15:38:57 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 7 Mar 2024 15:38:57 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> <4xUS7qBreZ6-cAbHSsVRB0u8Nr_MQa3SdrGiG33Nkw4=.6dad9868-ca3d-4617-bfc1-911df4ed7c2d@github.com> Message-ID: On Thu, 7 Mar 2024 15:23:45 GMT, Roland Westrelin wrote: >> I don't think it matters, no. Do you see a scenario where it would matter? >> >> My argument: >> It is safe to do the stores after the RC rather than before it. And if the RC trap relies on the memory state of the stores that were before the RC, then those stores simply don't lose all their uses, and stay in the graph. >> After all, we only remove the "last" store by replacing it with the merged store, so the other stores only disappear if they have no other use. > > Is there a chance then that we store to the same element twice (once with the store that we wanted to remove but haven't and the merged store)? I don't think repeated stores like this happen anywhere else as a result of some transformation. Would it be legal wrt the java specs? Can it be observed from some other thread? I think it would be better to not have to answer these questions and find a way to do the transformation in a way that guarantees the same element is not stored to twice. > Can the transformation be delayed until range check smearing has done its job? > Is there a chance then that we store to the same element twice ... Would it be legal wrt the java specs? AFAIU, introducing writes that do not exist in original program is an easy way to break JMM conformance. If we merge the writes, we have to make sure the old writes are not done. You _need_ to run jcstress on this change, at very least. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516374382 From chagedorn at openjdk.org Thu Mar 7 16:17:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Mar 2024 16:17:59 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: References: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> Message-ID: On Thu, 7 Mar 2024 15:31:23 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> rename extra -> memory > > src/hotspot/share/opto/vectorization.cpp line 225: > >> 223: DependencyNode* dn = new (_arena) DependencyNode(n, memory_pred_edges, _arena); >> 224: _dependency_nodes.at_put_grow(_body.bb_idx(n), dn, nullptr); >> 225: } > > The call to `add_node()` suggests that we add a node no matter what. I therefore suggest to either change `add_node` to something like `maybe_add_node` or do the check like that: > > if (memory_pred_edges.is_nonempty()) { > add_node(n1, memory_pred_edges); > } For completeness, should we also add a comment here or/and at `DependencyNode` that such a node is only created when there is no direct connection in the C2 memory graph since we would visit direct connections in `PredsIterator` anyways? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1516443249 From epeter at openjdk.org Thu Mar 7 17:07:59 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Mar 2024 17:07:59 GMT Subject: RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v17] In-Reply-To: References: <--46Po8FyZv3RkkRIoWUSrWAnCP-9LilMkYuyZj8jyk=.1d6a0c62-f8b9-4ffe-b0f2-bd6cba298f55@github.com> Message-ID: On Thu, 7 Mar 2024 16:03:20 GMT, Emanuel Peter wrote: >> That's not quite right, I think. For instance, in this method: >> >> private static int test(int[] array, int i) { >> return array[i + 1]; >> } >> >> the final IR will have the `(AddP base (AddP ...) constant)` because `ConvI2LNode::Ideal` does more than checking for overflow. The actual transformation to that final shape must be delayed until after the CastII nodes are removed though. Why that's the case is puzzling actually because `CastIINode::Ideal()` has logic to push the AddI thru the `CastII` but it's disabled for range check `CastII` nodes. I noticed this while working on 8324517. My recollection was that `ConvI2LNode::Ideal` would push thru both the `CastII` and `ConvI2L` in one go so I wonder if it got broken at some point. > > Thanks for info, I'll look into this :) Ah, I see what you are saying. The `AddI` can be pushed through the `ConvI2L`, but only because we know that the types are constrained. The types are constrained because of the `CastII` after the `RangeCheck`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1516525702 From enikitin at openjdk.org Thu Mar 7 17:15:02 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Thu, 7 Mar 2024 17:15:02 GMT Subject: Integrated: 8327390: JitTester: Implement temporary folder functionality In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 19:58:04 GMT, Evgeny Nikitin wrote: > The JITTester relies on standard OS / Java library functionality to create temporary folders and never cleans them. > > This creates problems in CI machines and also complicates problems investigation. We need to have a dedicated TempDir entity that we could adjust during problems investigations and development. It can also be a good place for various file-related activities, like executing FailureHandler. This pull request has now been integrated. Changeset: 5aae8030 Author: Evgeny Nikitin Committer: Leonid Mesnik URL: https://git.openjdk.org/jdk/commit/5aae80304c0b1b49341777b9da103638183877d5 Stats: 76 lines in 4 files changed: 63 ins; 4 del; 9 mod 8327390: JitTester: Implement temporary folder functionality Reviewed-by: gli, lmesnik ------------- PR: https://git.openjdk.org/jdk/pull/18128 From duke at openjdk.org Thu Mar 7 17:25:06 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 7 Mar 2024 17:25:06 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v3] In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > Benchmark | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > > > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: update implementation for avx=0 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18089/files - new: https://git.openjdk.org/jdk/pull/18089/files/0401e18e..15b36013 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=01-02 Stats: 8 lines in 2 files changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From sviswanathan at openjdk.org Thu Mar 7 18:12:54 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 7 Mar 2024 18:12:54 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v3] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Thu, 7 Mar 2024 17:25:06 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 >> >> >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 548492 | 2193497 | 4.00 >> MathBench.floorDouble | 548485 | 2192813 | 4.00 >> MathBench.rintDouble | 548488 | 2192578 | 4.00 >> MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > update implementation for avx=0 src/hotspot/cpu/x86/x86.ad line 3878: > 3876: assert(UseSSE >= 4, "required"); > 3877: if ((UseAVX == 0) && ($dst$$XMMRegister != $src$$XMMRegister)) { > 3878: __ pxor($dst$$XMMRegister, $dst$$XMMRegister); Please fix the indentation here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18089#discussion_r1516612031 From jkarthikeyan at openjdk.org Thu Mar 7 18:17:55 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 7 Mar 2024 18:17:55 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> Message-ID: On Thu, 7 Mar 2024 08:45:39 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Change transform to work on CMoves > > src/hotspot/share/opto/movenode.cpp line 322: > >> 320: if (phase->C->post_loop_opts_phase()) { >> 321: return nullptr; >> 322: } > > Putting the condition here would prevent any future optimization further down not to be executed. I think you should rather put this into the `is_minmax` method. Maybe this condition is now only relevant for `long`, but I think it would not hurt to also have it also for `int`, right? That's a good point, I think this will make the logic cleaner. I don't think it'll hurt it to have it for int either. > test/hotspot/jtreg/compiler/c2/irTests/TestIfMinMax.java line 139: > >> 137: public long testMaxL2E(long a, long b) { >> 138: return a <= b ? b : a; >> 139: } > > I assume some of the `long` patterns should also have become MaxL/MinL in some phase, right? Is there maybe some phase where the IR would actually show that? You can target the IR rule to a phase, I think. Would be worth a try. Oh true, I think we can identify MinL/MaxL before macro expansion is done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1516615194 PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1516616923 From duke at openjdk.org Thu Mar 7 18:25:23 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 7 Mar 2024 18:25:23 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v4] In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 > > > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 548492 | 2193497 | 4.00 > MathBench.floorDouble | 548485 | 2192813 | 4.00 > MathBench.rintDouble | 548488 | 2192578 | 4.00 > MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 > > > > > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: fix indendation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18089/files - new: https://git.openjdk.org/jdk/pull/18089/files/15b36013..d35951f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18089&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18089/head:pull/18089 PR: https://git.openjdk.org/jdk/pull/18089 From jkarthikeyan at openjdk.org Thu Mar 7 18:31:00 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 7 Mar 2024 18:31:00 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: <7vRLiWJ_2IIkKnFbdwNqg_fKT3WYuvj7YZCXcKx1cFE=.d4c1fd64-24cc-40e0-8707-4eed20a26135@github.com> References: <0FmAElIq0aU0FRlcmmkflJN3Xtq4ZlgqfiBIm0L0Hko=.457cb674-37c3-4059-b13c-59e4421dd1b2@github.com> <-qQPtIEWrm7eliUVSp6Jhzk4MQMrHQ8Y1zVViYKQ7w8=.6d7a66f4-9cf1-43e4-bbc3-482ee79f77dd@github.com> <7vRLiWJ_2IIkKnFbdwNqg_fKT3WYuvj7YZCXcKx1cFE=.d4c1fd64-24cc-40e0-8707-4eed20a26135@github.com> Message-ID: On Thu, 7 Mar 2024 08:43:00 GMT, Emanuel Peter wrote: >> And then you could actualy move the call to `CMoveNode::Ideal`. > > Who knows, maybe we one day extend this to other types I think moving the call to `CMoveNode::Ideal` would be a good idea, since it de-duplicates the call site. Would it still assert on non-supported types, then? I think it may make more sense if it simply filtered out the cmov types that it doesn't (currently) support. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17574#discussion_r1516631304 From sviswanathan at openjdk.org Thu Mar 7 19:06:54 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 7 Mar 2024 19:06:54 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v4] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Thu, 7 Mar 2024 18:25:23 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 >> >> >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 548492 | 2193497 | 4.00 >> MathBench.floorDouble | 548485 | 2192813 | 4.00 >> MathBench.rintDouble | 548488 | 2192578 | 4.00 >> MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > fix indendation @vamsi-parasa Thanks for these additional changes, it looks good to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1984235237 From dlong at openjdk.org Thu Mar 7 19:32:55 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 7 Mar 2024 19:32:55 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v4] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Thu, 7 Mar 2024 18:25:23 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. >> >> Below is the performance data on an Intel Tiger Lake machine. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 547979 | 2170198 | 3.96 >> MathBench.floorDouble | 547979 | 2167459 | 3.96 >> MathBench.rintDouble | 547962 | 2130499 | 3.89 >> MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 >> >> >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup >> -- | -- | -- | -- >> MathBench.ceilDouble | 548492 | 2193497 | 4.00 >> MathBench.floorDouble | 548485 | 2192813 | 4.00 >> MathBench.rintDouble | 548488 | 2192578 | 4.00 >> MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 >> >> >> >> >> >> > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > fix indendation Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18089#pullrequestreview-1923352759 From duke at openjdk.org Thu Mar 7 21:47:57 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 7 Mar 2024 21:47:57 GMT Subject: Integrated: 8327147: Improve performance of Math ceil, floor, and rint for x86 In-Reply-To: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Fri, 1 Mar 2024 19:11:58 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to provide ~4x faster implementation of Math.ceil, Math.floor and Math.rint for x86_64 CPUs. > > Below is the performance data on an Intel Tiger Lake machine. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=3) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 547979 | 2170198 | 3.96 > MathBench.floorDouble | 547979 | 2167459 | 3.96 > MathBench.rintDouble | 547962 | 2130499 | 3.89 > MathBench.addCeilFloorDouble | 501366 | 1754260 | 3.50 > > > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (UseAVX=0) | Stock JDK (ops/ms) | This PR (ops/ms) | Speedup > -- | -- | -- | -- > MathBench.ceilDouble | 548492 | 2193497 | 4.00 > MathBench.floorDouble | 548485 | 2192813 | 4.00 > MathBench.rintDouble | 548488 | 2192578 | 4.00 > MathBench.addCeilFloorDouble | 501761 | 1644714 | 3.28 > > > > > > This pull request has now been integrated. Changeset: 7c5e6e74 Author: vamsi-parasa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/7c5e6e74c8f559be919cea63ebf7004cda80ae75 Stats: 20 lines in 3 files changed: 8 ins; 11 del; 1 mod 8327147: Improve performance of Math ceil, floor, and rint for x86 Reviewed-by: jbhateja, sviswanathan, dlong ------------- PR: https://git.openjdk.org/jdk/pull/18089 From jbhateja at openjdk.org Fri Mar 8 01:14:58 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 8 Mar 2024 01:14:58 GMT Subject: RFR: 8327147: Improve performance of Math ceil, floor, and rint for x86 [v2] In-Reply-To: References: <1qAVM7lB1zwTbePizeC4TPK-VqOrmp6W8TLItA3OL5c=.9ff52797-fbb9-4814-bc3f-fe3227f1e5e3@github.com> Message-ID: On Tue, 5 Mar 2024 22:37:49 GMT, Dean Long wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> unify the implementation > > So if we can still generate the non-AVX encoding of > > `roundsd dst, src, mode` > > isn't there still a false dependency problem with `dst`? > @dean-long You bring up a very good point. The SSE instruction (roundsd dst, src, mode) also has a false dependency problem. This can be demonstrated by adding the following benchmark to MathBench.java: > > ``` > diff --git a/test/micro/org/openjdk/bench/java/lang/MathBench.java b/test/micro/org/openjdk/bench/java/lang/MathBench.java > index c7dde019154..feb472bba3d 100644 > --- a/test/micro/org/openjdk/bench/java/lang/MathBench.java > +++ b/test/micro/org/openjdk/bench/java/lang/MathBench.java > @@ -141,6 +141,11 @@ public double ceilDouble() { > return Math.ceil(double4Dot1); > } > > + @Benchmark > + public double useAfterCeilDouble() { > + return Math.ceil(double4Dot1) + Math.floor(double4Dot1); > + } > + > @Benchmark > public double copySignDouble() { > return Math.copySign(double81, doubleNegative12); > ``` > > The fix would be to do a pxor on dst before the SSE roundsd instruction, something like below: > > ``` > diff --git a/src/hotspot/cpu/x86/x86.ad b/src/hotspot/cpu/x86/x86.ad > index cf4aef83df2..eb6701f82a7 100644 > --- a/src/hotspot/cpu/x86/x86.ad > +++ b/src/hotspot/cpu/x86/x86.ad > @@ -3874,6 +3874,9 @@ instruct roundD_reg(legRegD dst, legRegD src, immU8 rmode) %{ > ins_cost(150); > ins_encode %{ > assert(UseSSE >= 4, "required"); > + if ((UseAVX == 0) && ($dst$$XMMRegister != $src$$XMMRegister)) { > + __ pxor($dst$$XMMRegister, $dst$$XMMRegister); > + } > __ roundsd($dst$$XMMRegister, $src$$XMMRegister, $rmode$$constant); > %} > ins_pipe(pipe_slow); > ``` FTR following link for more details on above issue https://github.com/openjdk/jdk/pull/16701#issuecomment-1815645570 ------------- PR Comment: https://git.openjdk.org/jdk/pull/18089#issuecomment-1984873081 From dlong at openjdk.org Fri Mar 8 01:44:54 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 8 Mar 2024 01:44:54 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: <7-xJ8ujbaK_K90zgAFMgdWGkpnN6u8o088Wdt-YCh88=.230e0470-32e0-4da6-a185-65682d4713bb@github.com> On Mon, 4 Mar 2024 09:12:12 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. Your front-end changes require back-end changes, which are only implemented for x86 and aarch64. So you need a way to disable this for other platforms, or port the fix to all platforms. Minimizing the amount of platform-specific code required would also help. ------------- Changes requested by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17667#pullrequestreview-1923880793 From gcao at openjdk.org Fri Mar 8 02:54:56 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 02:54:56 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 02:30:20 GMT, Fei Yang wrote: >> Hi, I noticed that RISC-V missed this change from #11044 [1]: >> >> `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` >> >> `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` >> >> >> After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. >> >> [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 >> >> ### Tests >> >> - [x] Run tier1-3 tests on SiFive unmatched (release) > > Thanks! @RealFYang Thanks for your review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18131#issuecomment-1984954773 From gcao at openjdk.org Fri Mar 8 02:54:55 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 02:54:55 GMT Subject: RFR: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 04:06:47 GMT, Gui Cao wrote: > Hi, I noticed that RISC-V missed this change from #11044 [1]: > > `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` > > `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` > > > After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. > > [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 > > ### Tests > > - [x] Run tier1-3 tests on SiFive unmatched (release) linux-riscv64 builds fine locally. GHA failure is infrastructural: https://bugs.openjdk.org/browse/JDK-8326960 ------------- PR Comment: https://git.openjdk.org/jdk/pull/18131#issuecomment-1984954480 From gcao at openjdk.org Fri Mar 8 03:00:59 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 03:00:59 GMT Subject: Integrated: 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 04:06:47 GMT, Gui Cao wrote: > Hi, I noticed that RISC-V missed this change from #11044 [1]: > > `I know @albertnetymk already touched on this but some thoughts on the unclear boundaries between the header and the data. My feeling is that the most pragmatic solution would be to have the header initialization always initialize up to the word aligned (up) header_size_in_bytes. (Similarly to how it is done for the instanceOop where the klass gap gets initialized with the header, even if it may be data.) And have the body initialization do the rest (word aligned to word aligned clear).` > > `This seems preferable than adding these extra alignment shims in-between the header and body/payload/data initialization. (I also tried moving the alignment fix into the body initialization, but it seems a little bit messier in the implementation.)` > > > After this patch, it will be more consistent with other CPU platforms like X86 and ARM64. > > [1] https://github.com/openjdk/jdk/pull/11044#pullrequestreview-1894323275 > > ### Tests > > - [x] Run tier1-3 tests on SiFive unmatched (release) This pull request has now been integrated. Changeset: de428daf Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/de428daf9adef5afe7347319f7a6f6732e9b6c4b Stats: 16 lines in 1 file changed: 6 ins; 7 del; 3 mod 8327426: RISC-V: Move alignment shim into initialize_header() in C1_MacroAssembler::allocate_array Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/18131 From gcao at openjdk.org Fri Mar 8 03:24:55 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 8 Mar 2024 03:24:55 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v2] In-Reply-To: References: Message-ID: On Wed, 28 Feb 2024 13:26:04 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > modify test config src/hotspot/cpu/riscv/riscv_v.ad line 3237: > 3235: // VectorCastS2X, VectorUCastS2X > 3236: > 3237: instruct vcvtStoB(vReg dst, vReg src) %{ Hi, Should use vcvtStoX instead of vcvtStoB? src/hotspot/cpu/riscv/riscv_v.ad line 3245: > 3243: match(Set dst (VectorCastS2X src)); > 3244: effect(TEMP_DEF dst); > 3245: format %{ "vcvtStoB $dst, $src" %} And Here, vcvtStoX can be used instead of vcvtStoB. test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastRVV.java line 37: > 35: * @modules java.base/jdk.internal.misc > 36: * @summary Test that vector cast intrinsics work as intended on riscv (rvv). > 37: * @requires os.arch == "riscv64" & vm.cpu.features ~= ".*v,.*" is it possible to match rvc here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517116676 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517117063 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517117349 From jkarthikeyan at openjdk.org Fri Mar 8 03:26:12 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 8 Mar 2024 03:26:12 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v7] In-Reply-To: References: Message-ID: > Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. > > I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* > IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* > IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) > IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) > IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x > IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x > > > * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? > > The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Move logic to CMoveNode::Ideal and improve IR test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17574/files - new: https://git.openjdk.org/jdk/pull/17574/files/2adebb73..f929239a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17574&range=05-06 Stats: 67 lines in 4 files changed: 15 ins; 32 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/17574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17574/head:pull/17574 PR: https://git.openjdk.org/jdk/pull/17574 From jkarthikeyan at openjdk.org Fri Mar 8 03:26:12 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 8 Mar 2024 03:26:12 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v6] In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 06:13:02 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Change transform to work on CMoves Thanks for taking another look! I've pushed a commit that should address the points brought up in review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1984975912 From jkarthikeyan at openjdk.org Fri Mar 8 06:14:57 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 8 Mar 2024 06:14:57 GMT Subject: RFR: 8324655: Identify integer minimum and maximum patterns created with if statements [v7] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 03:26:12 GMT, Jasmine Karthikeyan wrote: >> Hi all, I've created this patch which aims to convert common integer mininum and maximum patterns created using if statements into Min and Max nodes. These patterns are usually in the form of `a > b ? a : b` and similar, as well as patterns such as `if (a > b) b = a;`. While this transform doesn't generally improve code generation it's own, it simplifies control flow and creates new opportunities for vectorization. >> >> I've created a benchmark for the PR, and I've attached some data from my (Zen 3) machine: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> IfMinMax.testReductionInt avgt 15 500.307 ? 16.687 ns/op 509.383 ? 32.645 ns/op (no change)* >> IfMinMax.testReductionLong avgt 15 493.184 ? 17.596 ns/op 513.587 ? 28.339 ns/op (no change)* >> IfMinMax.testSingleInt avgt 15 3.588 ? 0.540 ns/op 2.965 ? 1.380 ns/op (no change) >> IfMinMax.testSingleLong avgt 15 3.673 ? 0.128 ns/op 3.506 ? 0.590 ns/op (no change) >> IfMinMax.testVectorInt avgt 15 340.425 ? 13.123 ns/op 59.689 ? 7.509 ns/op + 5.7x >> IfMinMax.testVectorLong avgt 15 326.420 ? 15.554 ns/op 117.190 ? 5.622 ns/op + 2.8x >> >> >> * After writing this benchmark I discovered that the compiler doesn't seem to create some simple min/max reductions, even when using Math.min/max() directly. Is this known or should I create a followup RFE for this? >> >> The patch passes tier 1-3 testing on linux x64. Reviews or comments would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Move logic to CMoveNode::Ideal and improve IR test Actually, while experimenting with Min/Max identities I found a case that the current CMove code couldn't fully transform: private static long test(long a, long b) { return Math.max(a, Math.max(a, b)); } Currently, only the second `Math.max` is transformed into a CMove- the first one is left as-is. The issue seems to be that the CMove code is trying to mistakenly use the more conservative loop-based heuristic instead of the one for straight-line code, even though there is no loop. This seems to happen in two places, first here: https://github.com/openjdk/jdk/blob/de428daf9adef5afe7347319f7a6f6732e9b6c4b/src/hotspot/share/opto/loopopts.cpp#L701-L704 This logic seems to be inverted, as it's checking if the region's enclosing loop is the root of the loop tree, or otherwise not a loop. It seems to be `true` if it's *not* in a loop, and `false` when it *is* in a loop. This also looks to be corroborated by the [JVM Anatomy Quarks on CMove](https://shipilev.net/jvm/anatomy-quarks/30-conditional-moves/) linked earlier where CMove only kicks in when the branch percent is >18% or <82%, which was the logic for loop CMoves before [JDK-8319451](https://bugs.openjdk.org/browse/JDK-8319451), even though `doCall` doesn't contain loops. I think this is a pretty simple fix to just invert the boolean expression. Then, there's a second place it happens, here: https://github.com/openjdk/jdk/blob/de428daf9adef5afe7347319f7a6f6732e9b6c4b/src/hotspot/share/opto/loopopts.cpp#L764-L775 Here, it sees if any consumers of the phi are a Cmp or Encode/DecodeNarrowOop, to delay the transform to split-if. In this case, the second If's Cmp consumes the phi so this code path is triggered. I'm less sure of what to do for this case, though. In this case I would say that it's being triggered in error, but there may be other cases where there is a benefit. I think the min/max transform should still be done after the CMove transform, but I think it'll be a good idea to look at this separately because it could have a widespread impact. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17574#issuecomment-1985098406 From epeter at openjdk.org Fri Mar 8 06:18:00 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:18:00 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v3] In-Reply-To: References: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> Message-ID: <_y7_kzXP48f_sv3riZeAQ4iypp70t97a_5tRQy5VIfw=.f798f117-e84b-46ea-b894-717a6ef496bf@github.com> On Thu, 7 Mar 2024 14:15:21 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> missing string Extra -> Memory change > > src/hotspot/share/opto/vectorization.hpp line 456: > >> 454: class VLoopDependencyGraph : public StackObj { >> 455: private: >> 456: class DependencyNode; > > I'm not sure if we should declare classes in the middle of another class. Should we move the forward declaration to the top of the file as done in other places as well? I see this pattern in other places in the codebase: class ciTypeFlow : public ArenaObj { private: ciEnv* _env; ciMethod* _method; int _osr_bci; bool _has_irreducible_entry; const char* _failure_reason; public: class StateVector; class Loop; class Block; I think it makes sense to declare "internal" (private) classes at the beginning of the class. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517208777 From epeter at openjdk.org Fri Mar 8 06:32:28 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:32:28 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v4] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: add_node change for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/31b65c6c..dd91e22e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=02-03 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 06:36:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:36:22 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v5] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: _depth -> _depths for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/dd91e22e..1eeced11 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=03-04 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 06:47:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:47:07 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v6] In-Reply-To: References: Message-ID: <2XwLSUCYSveLeQkqv0VynZ-UcjASyW_-jXpzOrjlGzg=.b5a6dda2-9d31-49df-a4c0-26b4f4945ef4@github.com> > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Apply from Christian's suggestions Co-authored-by: Christian Hagedorn - remove body() accessor from VLoopDependencyGraph, use field directly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/1eeced11..c3915bd1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=04-05 Stats: 9 lines in 2 files changed: 3 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 06:47:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:47:07 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v6] In-Reply-To: References: <7T93QS_MjoovUHDvfnq9az88QJ64dRcdZPpE9HUj5sw=.1d9563bb-e010-4b32-b635-f13f88c4f683@github.com> Message-ID: On Thu, 7 Mar 2024 14:19:23 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - Apply from Christian's suggestions >> >> Co-authored-by: Christian Hagedorn >> - remove body() accessor from VLoopDependencyGraph, use field directly > > src/hotspot/share/opto/vectorization.hpp line 467: > >> 465: >> 466: // Node depth in DAG: bb_idx -> depth >> 467: GrowableArray _depth; > > Suggestion: > > GrowableArray _depths; done > src/hotspot/share/opto/vectorization.hpp line 469: > >> 467: GrowableArray _depth; >> 468: >> 469: protected: > > Why is this protected? Ha. I thought I needed if for accessing by the inner class in `VLoopDependencyGraph::PredsIterator::next`. But I can directly access the field there. Removing the accessor. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517224320 PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517223449 From epeter at openjdk.org Fri Mar 8 06:47:07 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 06:47:07 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v2] In-Reply-To: References: <0ngHbfu0p0-3CdGMe9393YGxCsR9w2vpuqa4WdtZc3s=.ec178db2-87de-47c0-aa8f-2bd1d2e818ef@github.com> Message-ID: On Thu, 7 Mar 2024 16:14:38 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/vectorization.cpp line 225: >> >>> 223: DependencyNode* dn = new (_arena) DependencyNode(n, memory_pred_edges, _arena); >>> 224: _dependency_nodes.at_put_grow(_body.bb_idx(n), dn, nullptr); >>> 225: } >> >> The call to `add_node()` suggests that we add a node no matter what. I therefore suggest to either change `add_node` to something like `maybe_add_node` or do the check like that: >> >> if (memory_pred_edges.is_nonempty()) { >> add_node(n1, memory_pred_edges); >> } > > For completeness, should we also add a comment here or/and at `DependencyNode` that such a node is only created when there is no direct connection in the C2 memory graph since we would visit direct connections in `PredsIterator` anyways? Wrote this now: if (memory_pred_edges.is_nonempty()) { // Data edges are taken implicitly from the C2 graph, thus we only add // a dependency node if we have memory edges. add_node(n1, memory_pred_edges); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17812#discussion_r1517226544 From epeter at openjdk.org Fri Mar 8 07:48:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 07:48:22 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v7] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: rm trailing whitespaces from applied suggestion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/c3915bd1..cf4996b9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=05-06 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Fri Mar 8 07:58:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 07:58:22 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v6] In-Reply-To: References: Message-ID: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: - Merge branch 'master' into JDK-8309267 - Apply suggestions for comments by Vladimir - Update LoopArrayIndexComputeTest.java copyright year - Update src/hotspot/share/opto/superword.cpp - SplitStatus::Kind enum - SplitTask::Kind enum - manual merge - more fixes for TestSplitPacks.java - fix some IR rules in TestSplitPacks.java - fix MulAddS2I - ... and 23 more: https://git.openjdk.org/jdk/compare/de428daf...77e3d47a ------------- Changes: https://git.openjdk.org/jdk/pull/17848/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=05 Stats: 1268 lines in 5 files changed: 1206 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/17848.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17848/head:pull/17848 PR: https://git.openjdk.org/jdk/pull/17848 From epeter at openjdk.org Fri Mar 8 07:59:20 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Mar 2024 07:59:20 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: - Merge branch 'master' into JDK-8325651 - rm trailing whitespaces from applied suggestion - Apply from Christian's suggestions Co-authored-by: Christian Hagedorn - remove body() accessor from VLoopDependencyGraph, use field directly - _depth -> _depths for Christian - add_node change for Christian - missing string Extra -> Memory change - rename extra -> memory - typo - fix depth of Phi node - ... and 14 more: https://git.openjdk.org/jdk/compare/3c412c1e...d89119e1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17812/files - new: https://git.openjdk.org/jdk/pull/17812/files/cf4996b9..d89119e1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17812&range=06-07 Stats: 89095 lines in 1876 files changed: 10366 ins; 73809 del; 4920 mod Patch: https://git.openjdk.org/jdk/pull/17812.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17812/head:pull/17812 PR: https://git.openjdk.org/jdk/pull/17812 From duke at openjdk.org Fri Mar 8 09:56:06 2024 From: duke at openjdk.org (Oussama Louati) Date: Fri, 8 Mar 2024 09:56:06 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v6] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Refactor byte array parameter in generateBytecodes method ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/89292423..5fd2d743 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From mli at openjdk.org Fri Mar 8 12:06:05 2024 From: mli at openjdk.org (Hamlin Li) Date: Fri, 8 Mar 2024 12:06:05 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension Message-ID: Hi, Can you review this simple patch? Thanks FYI: test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 ------------- Commit messages: - Initial commit Changes: https://git.openjdk.org/jdk/pull/18169/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18169&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327689 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18169.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18169/head:pull/18169 PR: https://git.openjdk.org/jdk/pull/18169 From mli at openjdk.org Fri Mar 8 12:17:15 2024 From: mli at openjdk.org (Hamlin Li) Date: Fri, 8 Mar 2024 12:17:15 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/594927fb..646955f0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From mli at openjdk.org Fri Mar 8 12:17:15 2024 From: mli at openjdk.org (Hamlin Li) Date: Fri, 8 Mar 2024 12:17:15 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v2] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 03:20:34 GMT, Gui Cao wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> modify test config > > src/hotspot/cpu/riscv/riscv_v.ad line 3237: > >> 3235: // VectorCastS2X, VectorUCastS2X >> 3236: >> 3237: instruct vcvtStoB(vReg dst, vReg src) %{ > > Hi, Should use vcvtStoX instead of vcvtStoB? Thanks, fixed the typo. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1517640253 From fyang at openjdk.org Fri Mar 8 13:55:52 2024 From: fyang at openjdk.org (Fei Yang) Date: Fri, 8 Mar 2024 13:55:52 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: <9aXXsh96SkVbXDfscianv13ZteF0sdJ1if-NlPRwuZI=.11b73051-af20-4732-a500-6649027b1fd8@github.com> On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18169#pullrequestreview-1924928132 From ddong at openjdk.org Fri Mar 8 14:13:14 2024 From: ddong at openjdk.org (Denghui Dong) Date: Fri, 8 Mar 2024 14:13:14 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code Message-ID: Hi, Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. Thanks ------------- Commit messages: - 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code Changes: https://git.openjdk.org/jdk/pull/18170/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18170&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327693 Stats: 25 lines in 2 files changed: 12 ins; 9 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18170.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18170/head:pull/18170 PR: https://git.openjdk.org/jdk/pull/18170 From duke at openjdk.org Fri Mar 8 17:17:19 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Mar 2024 17:17:19 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v12] In-Reply-To: References: Message-ID: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 27 commits: - add missing avx_ifma in amd64.java - Merge branch 'master' of https://github.com/vamsi-parasa/jdk into jdk_poly - update asserts for vpmadd52l/hq - Update description of Poly1305 algo - add cpuinfo test for avx_ifma - fix checks for vpmadd52* - fix use_vl to true for vpmadd52* instrs - fix merge issues with avx_ifma - Merge branch 'master' of https://git.openjdk.java.net/jdk into jdk_poly - removed unused merge, faster and, redundant mov - ... and 17 more: https://git.openjdk.org/jdk/compare/5aae8030...35d39dc5 ------------- Changes: https://git.openjdk.org/jdk/pull/17881/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=11 Stats: 810 lines in 10 files changed: 800 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17881.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17881/head:pull/17881 PR: https://git.openjdk.org/jdk/pull/17881 From duke at openjdk.org Fri Mar 8 17:19:58 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Mar 2024 17:19:58 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v11] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 21:40:04 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > update asserts for vpmadd52l/hq Planning to integrate this PR by Monday. Could you please let me know if there are any objections? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17881#issuecomment-1986094404 From jbhateja at openjdk.org Fri Mar 8 17:58:55 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 8 Mar 2024 17:58:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> Message-ID: An HTML attachment was scrubbed... URL: From duke at openjdk.org Fri Mar 8 18:36:55 2024 From: duke at openjdk.org (Joshua Cao) Date: Fri, 8 Mar 2024 18:36:55 GMT Subject: RFR: 8325674: Constant fold across compares [v3] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 08:28:19 GMT, Emanuel Peter wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> comments with explanations and style changes > > src/hotspot/share/opto/subnode.cpp line 1586: > >> 1584: } >> 1585: } >> 1586: } > > This looks like heavy code duplication. Can you refactor this? Maybe a helper method? I can post a version of this so we can see what it looks like. I actually did this first, but the code got quite ugly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17853#discussion_r1518126089 From duke at openjdk.org Fri Mar 8 18:56:07 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Mar 2024 18:56:07 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: make vpmadd52l/hq generic ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17881/files - new: https://git.openjdk.org/jdk/pull/17881/files/35d39dc5..4d3e0ebb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17881&range=11-12 Stats: 14 lines in 1 file changed: 14 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17881.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17881/head:pull/17881 PR: https://git.openjdk.org/jdk/pull/17881 From sviswanathan at openjdk.org Fri Mar 8 23:44:52 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 8 Mar 2024 23:44:52 GMT Subject: RFR: 8327041: Incorrect lane size references in avx512 instructions. In-Reply-To: References: Message-ID: On Thu, 29 Feb 2024 11:09:09 GMT, Jatin Bhateja wrote: > - As per AVX-512 instruction format, a memory operand instruction can use compressed disp8*N encoding. > - For instructions which reads/writes entire vector from/to memory, scaling factor (N) computation only takes into account vector length and is not dependent on vector lane sizes[1]. > - Patch fixes incorrect lane size references from various x86 assembler routines, this is not a functionality bug, but correcting the lane size will make the code compliant with AVX-512 instruction format specification. > > [1] Intel SDM, Volume 2, Section 2.7.5 Table 2-35 > https://cdrdv2.intel.com/v1/dl/getContent/671200 Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18059#pullrequestreview-1926014748 From vlivanov at openjdk.org Sat Mar 9 02:33:03 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 9 Mar 2024 02:33:03 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 07:59:20 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: > > - Merge branch 'master' into JDK-8325651 > - rm trailing whitespaces from applied suggestion > - Apply from Christian's suggestions > > Co-authored-by: Christian Hagedorn > - remove body() accessor from VLoopDependencyGraph, use field directly > - _depth -> _depths for Christian > - add_node change for Christian > - missing string Extra -> Memory change > - rename extra -> memory > - typo > - fix depth of Phi node > - ... and 14 more: https://git.openjdk.org/jdk/compare/dcaa9b26...d89119e1 Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17812#pullrequestreview-1926075439 From vlivanov at openjdk.org Sat Mar 9 02:37:01 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 9 Mar 2024 02:37:01 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v6] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 07:58:22 GMT, Emanuel Peter wrote: >> After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. >> There are multiple reason for that: >> >> - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: >> >> X X X X Y Y Y Y >> Z Z Z Z Z Z Z Z >> >> >> - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. >> >> - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: >> https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 >> >> Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. >> >> **Further Work** >> >> [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize >> The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: > > - Merge branch 'master' into JDK-8309267 > - Apply suggestions for comments by Vladimir > - Update LoopArrayIndexComputeTest.java copyright year > - Update src/hotspot/share/opto/superword.cpp > - SplitStatus::Kind enum > - SplitTask::Kind enum > - manual merge > - more fixes for TestSplitPacks.java > - fix some IR rules in TestSplitPacks.java > - fix MulAddS2I > - ... and 23 more: https://git.openjdk.org/jdk/compare/de428daf...77e3d47a Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17848#pullrequestreview-1926075988 From gli at openjdk.org Sat Mar 9 05:01:52 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 9 Mar 2024 05:01:52 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18169#pullrequestreview-1926092454 From jbhateja at openjdk.org Sat Mar 9 07:13:55 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Mar 2024 07:13:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 18:56:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > make vpmadd52l/hq generic Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17881#pullrequestreview-1926118001 From jbhateja at openjdk.org Sat Mar 9 07:13:55 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Mar 2024 07:13:55 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v10] In-Reply-To: References: <9_AS9fKLjyVQzybg1915BX3xVVFwJRKjsqghjZ99hr0=.13a55d69-fc86-4109-a3be-934d1b8634d0@github.com> <_6dorzq67KAZsTBHBvbQRDi_xW70bFhJudnxbG88m6I=.33e06bd5-d5fc-4ba8-b740-437155d567cf@github.com> <_5z5emOe-VqjE7REHmk72wtJ-X_MUggxilrkXFUjdPo=.e30bafc3-0fc4-4872-a99c-f22e383301e3@github.com> Message-ID: On Fri, 8 Mar 2024 17:56:30 GMT, Jatin Bhateja wrote: > > [poly1305_spr_validation.patch](https://github.com/openjdk/jdk/files/14496404/poly1305_spr_validation.patch) > > Hi @vamsi-parasa , We do not want EVEX to VEX demotions for these newly added instruction on AVX512_IFMA targets since there are no VEX equivalent versions of these instructions, please pick the relevant fixes for assembler routines from my above patch. As @sviswa7 mentioned we should make these instruction generic. Thanks @vamsi-parasa ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1518500977 From jbhateja at openjdk.org Sat Mar 9 07:14:56 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Mar 2024 07:14:56 GMT Subject: Integrated: 8327041: Incorrect lane size references in avx512 instructions. In-Reply-To: References: Message-ID: On Thu, 29 Feb 2024 11:09:09 GMT, Jatin Bhateja wrote: > - As per AVX-512 instruction format, a memory operand instruction can use compressed disp8*N encoding. > - For instructions which reads/writes entire vector from/to memory, scaling factor (N) computation only takes into account vector length and is not dependent on vector lane sizes[1]. > - Patch fixes incorrect lane size references from various x86 assembler routines, this is not a functionality bug, but correcting the lane size will make the code compliant with AVX-512 instruction format specification. > > [1] Intel SDM, Volume 2, Section 2.7.5 Table 2-35 > https://cdrdv2.intel.com/v1/dl/getContent/671200 This pull request has now been integrated. Changeset: 2d4c757e Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/2d4c757e2e03b753135d564e9f2761052fdcb189 Stats: 81 lines in 1 file changed: 0 ins; 0 del; 81 mod 8327041: Incorrect lane size references in avx512 instructions. Reviewed-by: sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/18059 From gli at openjdk.org Sat Mar 9 09:12:56 2024 From: gli at openjdk.org (Guoxiong Li) Date: Sat, 9 Mar 2024 09:12:56 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks Nice found. Looks good. ------------- Marked as reviewed by gli (Committer). PR Review: https://git.openjdk.org/jdk/pull/18170#pullrequestreview-1926165582 From ksakata at openjdk.org Mon Mar 11 01:39:53 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Mon, 11 Mar 2024 01:39:53 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: <129mUYDenCcF6DZA7S2Ak8WkZfe8r7_VIyP2dB3VMug=.770cc17b-3de9-490f-bc71-cd77fdc973de@github.com> On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. Could someone please review this pull request? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-1987477446 From gcao at openjdk.org Mon Mar 11 02:39:03 2024 From: gcao at openjdk.org (Gui Cao) Date: Mon, 11 Mar 2024 02:39:03 GMT Subject: RFR: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint Message-ID: Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. ### Tests - [x] Run tier1-3 tests on on LicheePI 4A (release) - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) ------------- Commit messages: - Merge remote-tracking branch 'upstream/master' into JDK-8327716 - 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint Changes: https://git.openjdk.org/jdk/pull/18175/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18175&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327716 Stats: 23 lines in 3 files changed: 0 ins; 0 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/18175.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18175/head:pull/18175 PR: https://git.openjdk.org/jdk/pull/18175 From fyang at openjdk.org Mon Mar 11 02:55:52 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 02:55:52 GMT Subject: RFR: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint In-Reply-To: References: Message-ID: On Sat, 9 Mar 2024 09:39:38 GMT, Gui Cao wrote: > Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. > > ### Tests > - [x] Run tier1-3 tests on on LicheePI 4A (release) > - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) Looks good. Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18175#pullrequestreview-1926835684 From fyang at openjdk.org Mon Mar 11 04:04:53 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 04:04:53 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:17:15 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Hi, I have one comment after a brief look. src/hotspot/cpu/riscv/riscv_v.ad line 3397: > 3395: predicate(Matcher::vector_element_basic_type(n) == T_FLOAT); > 3396: match(Set dst (VectorCastL2X src)); > 3397: effect(TEMP_DEF dst); I see you added `TEMP_DEF dst` for some existing instructs like this one here. Do we really need it? I don't see such a need when reading the overlap constraints on vector operands from the RVV spec [1]: A destination vector register group can overlap a source vector register group only if one of the following holds: The destination EEW equals the source EEW. The destination EEW is smaller than the source EEW and the overlap is in the lowest-numbered part of the source register group (e.g., when LMUL=1, vnsrl.wi v0, v0, 3 is legal, but a destination of v1 is not). The destination EEW is greater than the source EEW, the source EMUL is at least 1, and the overlap is in the highest-numbered part of the destination register group (e.g., when LMUL=8, vzext.vf4 v0, v6 is legal, but a source of v0, v2, or v4 is not). [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vec-operands ------------- PR Review: https://git.openjdk.org/jdk/pull/18040#pullrequestreview-1926897564 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1519134205 From fyang at openjdk.org Mon Mar 11 04:11:52 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 04:11:52 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 FYI: The GHA linux-cross-compile for linux-riscv64 is back working again. You might want to merge and retrigger the GHA. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18169#issuecomment-1987600853 From chagedorn at openjdk.org Mon Mar 11 07:09:00 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 11 Mar 2024 07:09:00 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 07:59:20 GMT, Emanuel Peter wrote: >> Subtask of https://github.com/openjdk/jdk/pull/16620. >> >> **Goal** >> >> - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. >> - Refactoring: replace linked-list edges with a compact array for each node. >> - No behavioral change to vectorization. >> >> **Benchmark** >> >> I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). >> All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, >> ensuring that we spend a lot of time on the dependency graph compared to other components. >> >> Measured on `linux-x64` and turbo disabled. >> >> Measuring Compile time difference: >> `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` >> >> TestGraph.java >> >> public class TestGraph { >> static int RANGE = 100_000; >> >> public static void main(String[] args) { >> int[] a = new int[RANGE]; >> int[] b = new int[RANGE]; >> for (int i = 0; i < 10_000; i++) { >> test1(a, b, i % 100); >> } >> } >> >> static void test1(int[] a, int[] b, int offset) { >> for (int i = 0; i < RANGE/16-200; i++) { >> a[i * 16 + 0] = b[i * 16 + 0 + offset]; >> a[i * 16 + 1] = b[i * 16 + 1 + offset]; >> a[i * 16 + 2] = b[i * 16 + 2 + offset]; >> a[i * 16 + 3] = b[i * 16 + 3 + offset]; >> a[i * 16 + 4] = b[i * 16 + 4 + offset]; >> a[i * 16 + 5] = b[i * 16 + 5 + offset]; >> a[i * 16 + 6] = b[i * 16 + 6 + offset]; >> a[i * 16 + 7] = b[i * 16 + 7 + offset]; >> a[i * 16 + 8] = b[i * 16 + 8 + offset]; >> a[i * 16 + 9] = b[i * 16 + 9 + offset]; >> a[i * 16 + 10] = b[i * 16 + 10 + offset]; >> a[i * 16 + 11] = b[i * 16 + 11 + offset]; >> a[i * 16 + 12] = b[i * 16 + 12 + offset]; >> a[i * 16 + 13] = b[i * 16 + 13 + offset]; >> a[i * 16 + 14] = b[i * 16 + 14 + offset]; >> a[i * 16 + 15] = b[i * 16 + 15 + offset]; >> } >> } >> } >> >> >> >> Before: >> >> C2 Compile Time: 14.588 s >> ... >> IdealLoop: 13.670 s >> AutoVectorize: 11.703 s``` >> >> After: >> >> C2 Compile Time: 14.468 s >> ... >> IdealLoop: 13.595 s >> ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: > > - Merge branch 'master' into JDK-8325651 > - rm trailing whitespaces from applied suggestion > - Apply from Christian's suggestions > > Co-authored-by: Christian Hagedorn > - remove body() accessor from VLoopDependencyGraph, use field directly > - _depth -> _depths for Christian > - add_node change for Christian > - missing string Extra -> Memory change > - rename extra -> memory > - typo > - fix depth of Phi node > - ... and 14 more: https://git.openjdk.org/jdk/compare/75213358...d89119e1 Thanks for the updates! Looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17812#pullrequestreview-1927050290 From epeter at openjdk.org Mon Mar 11 07:15:05 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 07:15:05 GMT Subject: RFR: 8325651: C2 SuperWord: refactor the dependency graph [v8] In-Reply-To: References: Message-ID: <1wli0OuvENorCvLhLQhsba5e81CXKuDOuBLgh_2i75U=.031681ce-4e40-48f6-8771-c318c422f88e@github.com> On Mon, 11 Mar 2024 07:06:02 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8325651 >> - rm trailing whitespaces from applied suggestion >> - Apply from Christian's suggestions >> >> Co-authored-by: Christian Hagedorn >> - remove body() accessor from VLoopDependencyGraph, use field directly >> - _depth -> _depths for Christian >> - add_node change for Christian >> - missing string Extra -> Memory change >> - rename extra -> memory >> - typo >> - fix depth of Phi node >> - ... and 14 more: https://git.openjdk.org/jdk/compare/05cf327d...d89119e1 > > Thanks for the updates! Looks good. Thanks @chhagedorn @iwanowww for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17812#issuecomment-1987758567 From epeter at openjdk.org Mon Mar 11 07:15:06 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 07:15:06 GMT Subject: Integrated: 8325651: C2 SuperWord: refactor the dependency graph In-Reply-To: References: Message-ID: <-1qOCk3kq-DdEVHi91xUNrbVkxD4IRCFlLekmfUHqRM=.4d705f00-32d6-4ce7-b7fc-1d5b8caf43bf@github.com> On Mon, 12 Feb 2024 16:24:30 GMT, Emanuel Peter wrote: > Subtask of https://github.com/openjdk/jdk/pull/16620. > > **Goal** > > - Make the dependency graph, make it a module of `VLoopAnalyzer` -> `VLoopDependencyGraph`. > - Refactoring: replace linked-list edges with a compact array for each node. > - No behavioral change to vectorization. > > **Benchmark** > > I have a large loop body (extra large with hand-unrolling and ` -XX:LoopUnrollLimit=1000`). > All stores are connected to all previous loads, effectively creating an `O(n^2)` size graph, > ensuring that we spend a lot of time on the dependency graph compared to other components. > > Measured on `linux-x64` and turbo disabled. > > Measuring Compile time difference: > `/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestGraph::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=20 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph.java` > > TestGraph.java > > public class TestGraph { > static int RANGE = 100_000; > > public static void main(String[] args) { > int[] a = new int[RANGE]; > int[] b = new int[RANGE]; > for (int i = 0; i < 10_000; i++) { > test1(a, b, i % 100); > } > } > > static void test1(int[] a, int[] b, int offset) { > for (int i = 0; i < RANGE/16-200; i++) { > a[i * 16 + 0] = b[i * 16 + 0 + offset]; > a[i * 16 + 1] = b[i * 16 + 1 + offset]; > a[i * 16 + 2] = b[i * 16 + 2 + offset]; > a[i * 16 + 3] = b[i * 16 + 3 + offset]; > a[i * 16 + 4] = b[i * 16 + 4 + offset]; > a[i * 16 + 5] = b[i * 16 + 5 + offset]; > a[i * 16 + 6] = b[i * 16 + 6 + offset]; > a[i * 16 + 7] = b[i * 16 + 7 + offset]; > a[i * 16 + 8] = b[i * 16 + 8 + offset]; > a[i * 16 + 9] = b[i * 16 + 9 + offset]; > a[i * 16 + 10] = b[i * 16 + 10 + offset]; > a[i * 16 + 11] = b[i * 16 + 11 + offset]; > a[i * 16 + 12] = b[i * 16 + 12 + offset]; > a[i * 16 + 13] = b[i * 16 + 13 + offset]; > a[i * 16 + 14] = b[i * 16 + 14 + offset]; > a[i * 16 + 15] = b[i * 16 + 15 + offset]; > } > } > } > > > > Before: > > C2 Compile Time: 14.588 s > ... > IdealLoop: 13.670 s > AutoVectorize: 11.703 s``` > > After: > > C2 Compile Time: 14.468 s > ... > IdealLoop: 13.595 s > AutoVectorize: 11.539 s > > > Memory usage: `-XX:CompileCommand=memstat,TestGraph::test*,print` > Before: `8_225_680 - 8_258_408` byt... This pull request has now been integrated. Changeset: ca5ca85d Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/ca5ca85d2408abfcb8a37f16476dba13c3b474d0 Stats: 713 lines in 5 files changed: 285 ins; 404 del; 24 mod 8325651: C2 SuperWord: refactor the dependency graph Reviewed-by: chagedorn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/17812 From epeter at openjdk.org Mon Mar 11 07:36:11 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 07:36:11 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v7] In-Reply-To: References: Message-ID: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 34 commits: - manual merge master - Merge branch 'master' into JDK-8309267 - Apply suggestions for comments by Vladimir - Update LoopArrayIndexComputeTest.java copyright year - Update src/hotspot/share/opto/superword.cpp - SplitStatus::Kind enum - SplitTask::Kind enum - manual merge - more fixes for TestSplitPacks.java - fix some IR rules in TestSplitPacks.java - ... and 24 more: https://git.openjdk.org/jdk/compare/ca5ca85d...efab8718 ------------- Changes: https://git.openjdk.org/jdk/pull/17848/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=06 Stats: 1268 lines in 5 files changed: 1206 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/17848.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17848/head:pull/17848 PR: https://git.openjdk.org/jdk/pull/17848 From fyang at openjdk.org Mon Mar 11 07:45:58 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 11 Mar 2024 07:45:58 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:17:15 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > fix typo src/hotspot/cpu/riscv/riscv_v.ad line 3220: > 3218: ins_encode %{ > 3219: BasicType bt = Matcher::vector_element_basic_type(this); > 3220: if (is_floating_point_type(bt)) { Could `bt` (the vector element basic type) be floating point type for `VectorUCastB2X` node? I see our aarch64 counterpart has this assertion: `assert(bt == T_SHORT || bt == T_INT || bt == T_LONG, "must be");` [1]. Same question for `VectorUCastS2X` and `VectorUCastI2X` nodes. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L3752 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1519263293 From chagedorn at openjdk.org Mon Mar 11 10:10:00 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 11 Mar 2024 10:10:00 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v7] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:36:11 GMT, Emanuel Peter wrote: >> After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. >> There are multiple reason for that: >> >> - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: >> >> X X X X Y Y Y Y >> Z Z Z Z Z Z Z Z >> >> >> - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. >> >> - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: >> https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 >> >> Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. >> >> **Further Work** >> >> [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize >> The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 34 commits: > > - manual merge master > - Merge branch 'master' into JDK-8309267 > - Apply suggestions for comments by Vladimir > - Update LoopArrayIndexComputeTest.java copyright year > - Update src/hotspot/share/opto/superword.cpp > - SplitStatus::Kind enum > - SplitTask::Kind enum > - manual merge > - more fixes for TestSplitPacks.java > - fix some IR rules in TestSplitPacks.java > - ... and 24 more: https://git.openjdk.org/jdk/compare/ca5ca85d...efab8718 Apart from some minor comments, looks good to me, too! src/hotspot/share/opto/superword.cpp line 1582: > 1580: } > 1581: > 1582: // Split packs that have mutual dependence, until all packs are mutually_independent. Suggestion: // Split packs that have a mutual dependency, until all packs are mutually_independent. src/hotspot/share/opto/superword.cpp line 1590: > 1588: if (!is_marked_reduction(pack->at(0)) && > 1589: !mutually_independent(pack)) { > 1590: // Split in half. Maybe you could add a comment here that splitting in half is a best guess/intuitive way to continue src/hotspot/share/opto/superword.cpp line 3017: > 3015: if (!is_reduction_pack && > 3016: (!has_use_pack_superset(n0, n1) || > 3017: !has_use_pack_superset(n1, n0))) { Was first tricked by missing the inversion of the result. Maybe you can flip it and rename it to `has_no_use_pack_superset()`? src/hotspot/share/opto/superword.hpp line 339: > 337: const char* message() const { return _message; } > 338: > 339: int split_size() const { Should be `uint`: Suggestion: uint split_size() const { src/hotspot/share/opto/superword.hpp line 393: > 391: void split_packs(const char* split_name, SplitStrategy strategy); > 392: > 393: // Split packs at boundaries where left and right have different use or def packs. Just a general note, I'm not sure if you need to repeat the comments here when they are identical to the ones found at the definition in the source file. But I guess it does not hurt either. If you only want to keep one, I'd prefer to have the comments in the source file. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17848#pullrequestreview-1927154552 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519309037 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519459587 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519294500 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519296397 PR Review Comment: https://git.openjdk.org/jdk/pull/17848#discussion_r1519306208 From epeter at openjdk.org Mon Mar 11 10:32:10 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 10:32:10 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v8] In-Reply-To: References: Message-ID: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17848/files - new: https://git.openjdk.org/jdk/pull/17848/files/efab8718..747e2f03 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17848&range=06-07 Stats: 10 lines in 2 files changed: 1 ins; 4 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/17848.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17848/head:pull/17848 PR: https://git.openjdk.org/jdk/pull/17848 From rcastanedalo at openjdk.org Mon Mar 11 10:47:15 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 11 Mar 2024 10:47:15 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect Message-ID: This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. #### Testing - tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. ------------- Commit messages: - Add additional temporary register Changes: https://git.openjdk.org/jdk/pull/18183/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18183&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8326385 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18183.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18183/head:pull/18183 PR: https://git.openjdk.org/jdk/pull/18183 From ddong at openjdk.org Mon Mar 11 11:21:18 2024 From: ddong at openjdk.org (Denghui Dong) Date: Mon, 11 Mar 2024 11:21:18 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled Message-ID: Hi, Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. I am not very sure if I have changed all the places that should be. Performance: I wrote a simple JMH included in this patch. On Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Before this change: Benchmark Mode Cnt Score Error Units C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.270 ? 0.011 ns/op C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.479 ? 0.012 ns/op After this change: Benchmark Mode Cnt Score Error Units C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.264 ? 0.006 ns/op C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.057 ? 0.005 ns/op Testing: fastdebug tier1-4 on Linux x64 ------------- Commit messages: - add a jmh test - update comment - fix failure and update header - update comment - 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled Changes: https://git.openjdk.org/jdk/pull/18167/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327661 Stats: 134 lines in 5 files changed: 109 ins; 1 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/18167.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18167/head:pull/18167 PR: https://git.openjdk.org/jdk/pull/18167 From chagedorn at openjdk.org Mon Mar 11 11:25:55 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 11 Mar 2024 11:25:55 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v8] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 10:32:10 GMT, Emanuel Peter wrote: >> After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. >> There are multiple reason for that: >> >> - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: >> >> X X X X Y Y Y Y >> Z Z Z Z Z Z Z Z >> >> >> - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. >> >> - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: >> https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 >> >> Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. >> >> **Further Work** >> >> [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize >> The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > for Christian Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17848#pullrequestreview-1927573214 From mli at openjdk.org Mon Mar 11 12:16:01 2024 From: mli at openjdk.org (Hamlin Li) Date: Mon, 11 Mar 2024 12:16:01 GMT Subject: RFR: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 04:09:03 GMT, Fei Yang wrote: > FYI: The GHA linux-cross-compile for linux-riscv64 is back working again. You might want to merge and retrigger the GHA. Thanks for reminding. Just FYI, it still failed, https://github.com/Hamlin-Li/jdk/actions/runs/8202918492/job/22510046373 Thanks @RealFYang @lgxbslgx for your reviewing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18169#issuecomment-1988299203 PR Comment: https://git.openjdk.org/jdk/pull/18169#issuecomment-1988299670 From mli at openjdk.org Mon Mar 11 12:16:02 2024 From: mli at openjdk.org (Hamlin Li) Date: Mon, 11 Mar 2024 12:16:02 GMT Subject: Integrated: 8327689: RISC-V: adjust test filters of zfh extension In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 12:01:02 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > Thanks > > FYI: > test filter `vm.cpu.features ~= ".*zfh,.*"` could be adjusted to `vm.cpu.features ~= ".*zfh.*"` according to comment at https://github.com/openjdk/jdk/pull/17698#discussion_r1517349407 This pull request has now been integrated. Changeset: 680ac2ce Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/680ac2cebecf93e5924a441a5de6918cd7adf118 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod 8327689: RISC-V: adjust test filters of zfh extension Reviewed-by: fyang, gli ------------- PR: https://git.openjdk.org/jdk/pull/18169 From aboldtch at openjdk.org Mon Mar 11 13:04:52 2024 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Mon, 11 Mar 2024 13:04:52 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. lgtm. > * tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. Guess it meant to say `macosx-aarch64` ------------- Marked as reviewed by aboldtch (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18183#pullrequestreview-1927766977 From dnsimon at openjdk.org Mon Mar 11 13:10:16 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 13:10:16 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass Message-ID: This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. ------------- Commit messages: - fix javadoc for return type of ResolvedJavaType.hasFinalizableSubclass Changes: https://git.openjdk.org/jdk/pull/18192/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18192&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327790 Stats: 5 lines in 1 file changed: 2 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18192.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18192/head:pull/18192 PR: https://git.openjdk.org/jdk/pull/18192 From rcastanedalo at openjdk.org Mon Mar 11 13:31:51 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 11 Mar 2024 13:31:51 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:29 GMT, Axel Boldt-Christmas wrote: > lgtm. Thanks for reviewing, Axel! > > * tier1-7 (linux-aarch64 and macosx-x64) with `-XX:LockingMode=2`. > > Guess it meant to say `macosx-aarch64` Right, good catch, updated in the description. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18183#issuecomment-1988446387 From gdub at openjdk.org Mon Mar 11 13:40:55 2024 From: gdub at openjdk.org (Gilles Duboscq) Date: Mon, 11 Mar 2024 13:40:55 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: <2JC64ehgjieSzyJgc42twIZ47NKvPm7WlYM08Jxl-hU=.009e98f2-9e4b-4e6f-ac99-a8562395b0d1@github.com> On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Marked as reviewed by gdub (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18192#pullrequestreview-1927851330 From rkennke at openjdk.org Mon Mar 11 14:03:52 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 11 Mar 2024 14:03:52 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. I've got a question. Also, what about the other arches? src/hotspot/cpu/aarch64/aarch64.ad line 16022: > 16020: %} > 16021: > 16022: instruct cmpFastLockLightweight(rFlagsReg cr, iRegP object, iRegP box, iRegPNoSp tmp, iRegPNoSp tmp2, iRegPNoSp tmp3) Do we need to specify the box register at all, if we never use it? It means that the register allocator assigns an actual register to it, right? This could be a problem in workloads that are both locking-intensive *and* with high register pressure. You may just not see it with dacapo, etc, because aarch64 has so many registers to begin with. ------------- PR Review: https://git.openjdk.org/jdk/pull/18183#pullrequestreview-1927918533 PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1519772909 From aboldtch at openjdk.org Mon Mar 11 16:32:17 2024 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Mon, 11 Mar 2024 16:32:17 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> References: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> Message-ID: On Mon, 11 Mar 2024 14:01:42 GMT, Roman Kennke wrote: > Also, what about the other arches? RISC-V and x64/x86 both binds the box to a specific register so it can be (and is) specified as `USE_KILL`. PPC64 (and aarch64 after this pr) uses an extra register allocation, and does not kill the box. > Do we need to specify the box register at all, if we never use it? I believe that would require rewriting large parts of the C2 FastLockNode. It is modelled as a CmpNode. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18183#issuecomment-1988596188 From rcastanedalo at openjdk.org Mon Mar 11 16:32:22 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 11 Mar 2024 16:32:22 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> References: <7ug7UA5M1tHuFRwxPJtKcQ4hntLKlj08Tfb8QyLKFRE=.6bddd1e2-e6fb-4e0e-84d1-5dbd7be5d226@github.com> Message-ID: On Mon, 11 Mar 2024 14:01:10 GMT, Roman Kennke wrote: >> This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. >> >> Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. >> >> #### Testing >> >> - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. > > src/hotspot/cpu/aarch64/aarch64.ad line 16022: > >> 16020: %} >> 16021: >> 16022: instruct cmpFastLockLightweight(rFlagsReg cr, iRegP object, iRegP box, iRegPNoSp tmp, iRegPNoSp tmp2, iRegPNoSp tmp3) > > Do we need to specify the box register at all, if we never use it? It means that the register allocator assigns an actual register to it, right? This could be a problem in workloads that are both locking-intensive *and* with high register pressure. You may just not see it with dacapo, etc, because aarch64 has so many registers to begin with. There is, unfortunately, no ADL construction to specify that an operand such as `box` is not used at all. Yes, the register allocator will assign a register to `box`. Avoiding this would require fairly intrusive changes in C2 (add lightweight locking-specific, single-input versions of the `FastLock` and `FastUnlock` nodes and adapt all the C2 logic that deals with them), which I think would be best addressed in a separate RFE (assuming the additional register pressure is a problem in practice). Furthermore, to my understanding, the box operand is likely to be needed again in the context of Lilliput's [OMWorld](https://bugs.openjdk.org/browse/JDK-8326750) sub-project. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1519942947 From kxu at openjdk.org Mon Mar 11 16:42:17 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 11 Mar 2024 16:42:17 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value Message-ID: This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. ------------- Commit messages: - add license header - also test for correctness - exclude x86 from tests - refactor (x & m) u<= m transformation and add test Changes: https://git.openjdk.org/jdk/pull/18198/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327381 Stats: 111 lines in 2 files changed: 93 ins; 17 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From epeter at openjdk.org Mon Mar 11 16:42:37 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 11 Mar 2024 16:42:37 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert Message-ID: The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. ---------- **Problem** By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: the PURPLE main and post loops are already previously removed as empty-loops. At the time of the assert, the graph looks like this: ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. The loop-tree looks essencially like this: (rr) p _ltree_root->dump() Loop: N0/N0 has_sfpt Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } This is basically: 415 pre orange 298 pre PURPLE 200 main orange 398 post orange >From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... ...but not the same pre-loop as `cl` -> the `assert` fires. It seems that we assume in the code, that we can check the `_next->_head`, and if: 1) it is a main-loop and 2) that main-loop still has a pre-loop then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". The loop-tree is correct here, and this is how it was arrived at: "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. So depending on what turn we take at this "427 If", we either get the order: 415 pre orange 298 pre PURPLE 200 main orange 398 post orange (the one we get, and assert with) OR 415 pre orange 200 main orange 398 post orange 298 pre PURPLE (assert woud not trigger, since we would have "_next == nullptr" and return) -------- **Solution** We need to convert the `assert` into a condition. If the condition fails, we have no main-loop, and can just return from `remove_main_post_loops`. Note: if we have an empty pre-loop that still has a main-loop, then the loop-tree must have the pre-loop and main-loop adjacent, i.e. you can get from the pre-loop to its main-loop via `_next`. That is because the `build_loop_tree` traversal cannot take any other path. ------------- Commit messages: - 8327423 Changes: https://git.openjdk.org/jdk/pull/18200/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18200&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327423 Stats: 64 lines in 2 files changed: 61 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18200/head:pull/18200 PR: https://git.openjdk.org/jdk/pull/18200 From duke at openjdk.org Mon Mar 11 16:45:03 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 11 Mar 2024 16:45:03 GMT Subject: Integrated: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions In-Reply-To: References: Message-ID: On Thu, 15 Feb 2024 18:42:49 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. > > This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) > > This PR shows upto 19x speedup on buffer sizes of 1MB. This pull request has now been integrated. Changeset: 18de9321 Author: vamsi-parasa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/18de9321ce8722f244594b1ed3b62cd1421a7994 Stats: 824 lines in 10 files changed: 814 ins; 0 del; 10 mod 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions Reviewed-by: sviswanathan, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/17881 From never at openjdk.org Mon Mar 11 16:48:53 2024 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 11 Mar 2024 16:48:53 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18192#pullrequestreview-1928410210 From kxu at openjdk.org Mon Mar 11 16:52:07 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 11 Mar 2024 16:52:07 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: fix test by adding the missing inversion also excluding negative values for unsigned comparison ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/aa7fafb8..17a9dc37 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From never at openjdk.org Mon Mar 11 16:59:52 2024 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 11 Mar 2024 16:59:52 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Fix is a trivial improvement to JavaDoc. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18192#issuecomment-1988953716 From dnsimon at openjdk.org Mon Mar 11 17:08:13 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 17:08:13 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass [v2] In-Reply-To: References: Message-ID: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Doug Simon has updated the pull request incrementally with one additional commit since the last revision: update year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18192/files - new: https://git.openjdk.org/jdk/pull/18192/files/0df2978b..241c9b9c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18192&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18192&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18192.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18192/head:pull/18192 PR: https://git.openjdk.org/jdk/pull/18192 From dnsimon at openjdk.org Mon Mar 11 17:08:13 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 17:08:13 GMT Subject: RFR: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18192#issuecomment-1988972833 From dnsimon at openjdk.org Mon Mar 11 17:08:13 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Mar 2024 17:08:13 GMT Subject: Integrated: 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass In-Reply-To: References: Message-ID: <55yo2J9eifZi68SObJT7bF2R9jt2SpSdeAVdbJeNVQI=.e18cfd13-bcbf-4272-a15e-bee3941887a3@github.com> On Mon, 11 Mar 2024 13:02:00 GMT, Doug Simon wrote: > This PR fixes and clarifies the javadoc for `ResolvedJavaType.hasFinalizableSubclass()`. This pull request has now been integrated. Changeset: b9bc31f7 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/b9bc31f7206bfde3d27be01adec9a658e086b86e Stats: 6 lines in 1 file changed: 2 ins; 0 del; 4 mod 8327790: Improve javadoc for ResolvedJavaType.hasFinalizableSubclass Reviewed-by: gdub, never ------------- PR: https://git.openjdk.org/jdk/pull/18192 From dlong at openjdk.org Mon Mar 11 19:22:12 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 11 Mar 2024 19:22:12 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 11:12:53 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. > > There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. > > I am not very sure if I have changed all the places that should be. > > Performance: > > I wrote a simple JMH included in this patch. > > On Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > > Before this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.270 ? 0.011 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.479 ? 0.012 ns/op > > > After this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.264 ? 0.006 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.057 ? 0.005 ns/op > > > > Testing: fastdebug tier1-4 on Linux x64 src/hotspot/cpu/x86/c1_Defs_x86.hpp line 47: > 45: > 46: #ifdef _LP64 > 47: #define UNALLOCATED 3 // rsp, r15, r10 This affects pd_nof_caller_save_cpu_regs_frame_map below, but RBP is callee-saved, not caller-saved. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18167#discussion_r1520305676 From kvn at openjdk.org Mon Mar 11 20:04:12 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Mar 2024 20:04:12 GMT Subject: RFR: 8323242: Remove vestigial DONT_USE_REGISTER_DEFINES In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:07:19 GMT, Koichi Sakata wrote: > This pull request removes an unnecessary directive. > > There is no definition of DONT_USE_REGISTER_DEFINES in HotSpot or the build system, so this `#ifndef`conditional directive is always true. We can remove it. > > I built OpenJDK with Zero VM as a test. It was successful. > > > $ ./configure --with-jvm-variants=zero --enable-debug > $ make images > $ ./build/macosx-aarch64-zero-fastdebug/jdk/bin/java -version > openjdk version "23-internal" 2024-09-17 > OpenJDK Runtime Environment (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk) > OpenJDK 64-Bit Zero VM (fastdebug build 23-internal-adhoc.jyukutyo.jyukutyo-jdk, interpreted mode) > > > It may be possible to remove the `#define noreg` as well because the CONSTANT_REGISTER_DECLARATION macro creates a variable named noreg, but I can't be sure. When I tried removing the noreg definition and building the OpenJDK, the build was successful. This was from these changes [JDK-8000780](https://github.com/openjdk/jdk/commit/e184d5cc4ec66640366d2d30d8dfaba74a1003a7) May be @rkennke remember why he added it. May be for some debugging purpose. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18115#issuecomment-1989328275 From jkarthikeyan at openjdk.org Mon Mar 11 21:16:15 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 11 Mar 2024 21:16:15 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: <7b1BIvQpmoLhSzWqQ7haDBTQU1NDuddEm1TK7AgWnwY=.0e5222cc-b20d-4a19-94db-9cad00c6dbff@github.com> On Mon, 11 Mar 2024 16:52:07 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > fix test by adding the missing inversion > > also excluding negative values for unsigned comparison I think the cleanup looks good! I have mostly stylistic suggestions here. Also, the copyright header in `subnode.cpp` should be updated to read 2024. src/hotspot/share/opto/subnode.cpp line 1808: > 1806: // based on local information. If the input is constant, do it. > 1807: const Type* BoolNode::Value(PhaseGVN* phase) const { > 1808: Node *cmp = in(1); It's preferred to use `Type*` for pointer types, so this `Node *var` (and the others below) should be `Node* var`. src/hotspot/share/opto/subnode.cpp line 1809: > 1807: const Type* BoolNode::Value(PhaseGVN* phase) const { > 1808: Node *cmp = in(1); > 1809: if (cmp && cmp->is_Sub()) { Suggestion: if (cmp != nullptr && cmp->is_Sub()) { The `cmp` condition should be `cmp != nullptr`, to make it more clear what is being compared. test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java line 38: > 36: * @summary > 37: * @library /test/lib / > 38: * @run driver compiler.c2.TestBoolNodeGvn I think it'd be better to move the test to `c2.irTests`, so that it's grouped with other IR tests. Also, it would be good to add a `@bug` tag and fill out the `@summary` tag. ------------- PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1929183892 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1520397799 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1520394190 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1520411065 From duke at openjdk.org Mon Mar 11 21:57:31 2024 From: duke at openjdk.org (Oussama Louati) Date: Mon, 11 Mar 2024 21:57:31 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v7] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Converted regression Test for "MethodType leaks memory to use Classfile API ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/5fd2d743..74c14dd4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=05-06 Stats: 20 lines in 2 files changed: 5 ins; 5 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From duke at openjdk.org Mon Mar 11 22:04:24 2024 From: duke at openjdk.org (Oussama Louati) Date: Mon, 11 Mar 2024 22:04:24 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v8] In-Reply-To: References: Message-ID: <_ZWd6EXhs0Wu4aHS48eHuC0M-Hy-ZabVdRMjH-KbT0Y=.95ff4c3e-85c4-4102-96f2-dba7a52e7251@github.com> > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Use Java 11 version for class generation in regression test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/74c14dd4..be3f49b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=06-07 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From dlong at openjdk.org Mon Mar 11 23:18:13 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 11 Mar 2024 23:18:13 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. src/hotspot/cpu/aarch64/aarch64.ad line 16026: > 16024: predicate(LockingMode == LM_LIGHTWEIGHT); > 16025: match(Set cr (FastLock object box)); > 16026: effect(TEMP tmp, TEMP tmp2, TEMP tmp3); Why not use `box` as the temp instead of intruducing a separate temp? Suggestion: effect(TEMP tmp, TEMP tmp2, TEMP box); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1520532174 From gcao at openjdk.org Tue Mar 12 01:29:24 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 12 Mar 2024 01:29:24 GMT Subject: RFR: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 02:53:12 GMT, Fei Yang wrote: >> Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. >> >> ### Tests >> - [x] Run tier1-3 tests on on LicheePI 4A (release) >> - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) > > Looks good. Thanks! @RealFYang : Thanks for your review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18175#issuecomment-1989739973 From gcao at openjdk.org Tue Mar 12 01:32:20 2024 From: gcao at openjdk.org (Gui Cao) Date: Tue, 12 Mar 2024 01:32:20 GMT Subject: Integrated: 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint In-Reply-To: References: Message-ID: On Sat, 9 Mar 2024 09:39:38 GMT, Gui Cao wrote: > Hi, we noticed that the return type of Matcher::vector_length is uint, But the type of vector_length param of several assembler functions is int, which is not consistent. This should not affect functionality, but we should change type of vector_length param of several assembler functions from int to uint to make the code clean. > > ### Tests > - [x] Run tier1-3 tests on on LicheePI 4A (release) > - [x] Run tier1-3 tests with -XX:+UseRVV on qemu 8.1.0 (release) This pull request has now been integrated. Changeset: 4d6235ed Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/4d6235ed111178d31814763b0d23e372db2b3e1b Stats: 23 lines in 3 files changed: 0 ins; 0 del; 23 mod 8327716: RISC-V: Change type of vector_length param of several assembler functions from int to uint Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/18175 From ksakata at openjdk.org Tue Mar 12 01:49:37 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Tue, 12 Mar 2024 01:49:37 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output Message-ID: This is a trivial change to remove an extra whitespace. A double whitespace is printed because method->print_short_name already adds a whitespace before the name. ### Test For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. @Test @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) public static void test15(Object o) { This change was only for testing, so I reverted back to the original code after the test. #### Execution Result Before the change: $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" ... Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap plyIfAnd={}, applyIfNot={})" > Phase "After Parsing": - counts: Graph contains wrong number of nodes: * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" - Failed comparison: [found] 1 = 11 [given] - Matched node: * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) After the change: $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" ... Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap plyIfAnd={}, applyIfNot={})" > Phase "After Parsing": - counts: Graph contains wrong number of nodes: * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" - Failed comparison: [found] 1 = 11 [given] - Matched node: * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) I was able confirm that the thing has been corrected. ------------- Commit messages: - Remove an extra whitespace Changes: https://git.openjdk.org/jdk/pull/18181/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18181&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8320404 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18181/head:pull/18181 PR: https://git.openjdk.org/jdk/pull/18181 From kvn at openjdk.org Tue Mar 12 02:52:12 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 12 Mar 2024 02:52:12 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 15:56:38 GMT, Emanuel Peter wrote: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... I agree with fix. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18200#pullrequestreview-1929992678 From galder at openjdk.org Tue Mar 12 05:35:12 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 12 Mar 2024 05:35:12 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:41:20 GMT, Roman Kennke wrote: >> A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Marked as reviewed by galder (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/18120#pullrequestreview-1930137852 From galder at openjdk.org Tue Mar 12 05:35:13 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 12 Mar 2024 05:35:13 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 14:37:19 GMT, Roman Kennke wrote: >> I think the changes look fine, but looking closer to the original PR, src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.hpp might also need adjusting. s390 and ppc are probably just fine. > > @galderz is it ok now? I assume it counts as trivial, too? @rkennke Yeah, ok now. Trivial too. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18120#issuecomment-1990584589 From aboldtch at openjdk.org Tue Mar 12 07:28:13 2024 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Tue, 12 Mar 2024 07:28:13 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 23:15:50 GMT, Dean Long wrote: >> This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. >> >> Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. >> >> #### Testing >> >> - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. > > src/hotspot/cpu/aarch64/aarch64.ad line 16026: > >> 16024: predicate(LockingMode == LM_LIGHTWEIGHT); >> 16025: match(Set cr (FastLock object box)); >> 16026: effect(TEMP tmp, TEMP tmp2, TEMP tmp3); > > Why not use `box` as the temp instead of intruducing a separate temp? > Suggestion: > > effect(TEMP tmp, TEMP tmp2, TEMP box); Making an input TEMP will crash inside C2 when doing register allocation. assert(opcnt < numopnds) failed: Accessing non-existent operand V [libjvm.dylib+0xcc650c] MachNode::in_RegMask(unsigned int) const+0x1f0 V [libjvm.dylib+0x3d136c] PhaseChaitin::gather_lrg_masks(bool)+0x1130 V [libjvm.dylib+0x3cef9c] PhaseChaitin::Register_Allocate()+0x150 V [libjvm.dylib+0x4c3860] Compile::Code_Gen()+0x1f4 V [libjvm.dylib+0x4c17a4] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1388 V [libjvm.dylib+0x38abf0] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0 V [libjvm.dylib+0x4df028] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854 V [libjvm.dylib+0x4de46c] CompileBroker::compiler_thread_loop()+0x348 V [libjvm.dylib+0x8c10e4] JavaThread::thread_main_inner()+0x1dc V [libjvm.dylib+0x117f7f8] Thread::call_run()+0xf4 V [libjvm.dylib+0xe53724] thread_native_entry(Thread*)+0x138 C [libsystem_pthread.dylib+0x7034] _pthread_start+0x88 Maybe this can be resolved and support for TEMP input registers can be added. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18183#discussion_r1520969175 From epeter at openjdk.org Tue Mar 12 07:30:23 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 07:30:23 GMT Subject: RFR: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) [v6] In-Reply-To: <3F6Lk5fgyufLqS9TwuZrNkiIuSCRzuSHKB2BD0cf2Ws=.c4958e55-8384-451a-a505-a082c9c3ed7e@github.com> References: <3ELCyRCBgVgCAApIxalvVinXQLbSdv-UK8_aHgbWLhA=.f2908c00-cdbb-4b82-a36d-8a1a21f2647b@github.com> <3F6Lk5fgyufLqS9TwuZrNkiIuSCRzuSHKB2BD0cf2Ws=.c4958e55-8384-451a-a505-a082c9c3ed7e@github.com> Message-ID: <42HQ9nFhr5bgo6gTtOms4H9H-C9K5FhoB0qrQFA1Hzo=.e6f06954-fcaa-4ed3-8865-092dbd5fb35a@github.com> On Wed, 28 Feb 2024 18:11:28 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: >> >> - Merge branch 'master' into JDK-8309267 >> - Apply suggestions for comments by Vladimir >> - Update LoopArrayIndexComputeTest.java copyright year >> - Update src/hotspot/share/opto/superword.cpp >> - SplitStatus::Kind enum >> - SplitTask::Kind enum >> - manual merge >> - more fixes for TestSplitPacks.java >> - fix some IR rules in TestSplitPacks.java >> - fix MulAddS2I >> - ... and 23 more: https://git.openjdk.org/jdk/compare/de428daf...77e3d47a > > Looks good. Thanks @vnkozlov @iwanowww @chhagedorn for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17848#issuecomment-1990939531 From epeter at openjdk.org Tue Mar 12 07:30:24 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 07:30:24 GMT Subject: Integrated: 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) In-Reply-To: References: Message-ID: On Wed, 14 Feb 2024 15:10:18 GMT, Emanuel Peter wrote: > After `combine_pairs_to_longer_packs`, we sometimes create packs that are too long and cannot be vectorized. > There are multiple reason for that: > > - A pack does not "match" with a use of def pack, and we need to split it. Example: split Z: > > X X X X Y Y Y Y > Z Z Z Z Z Z Z Z > > > - A pack is not implemented in the current size, but would be implemented for a smaller size. Some operations are not implemented at max vector size, and we just need to split every pack to the smaller size that is implemented. Or we have a pack of a non-power-of-2 size, and need to split it down into smaller sizes that are power-of-2. > > - Packs can have pack internal dependence. This dependence happens at a certain "distance". If we split a pack to be smaller than that "distance", then the internal dependence disappears, and we have the desired mutual independence. Example: > https://github.com/openjdk/jdk/blob/9ee274b0480a6e8e399830fd40c34d99c5621c1b/test/hotspot/jtreg/compiler/loopopts/superword/TestSplitPacks.java#L657-L669 > > Note: Soon I will refactor the packset into a new `PackSet` class, and move the split / filter code there. > > **Further Work** > > [JDK-8309267](https://bugs.openjdk.org/browse/JDK-8309267) C2 SuperWord: some tests fail on KNL machines - fail to vectorize > The issue on KNL machines is that some operations only support vector widths that are smaller than the maximal vector length. Hence, we must split all other vectors that are uses/defs. This change here does exactly that, and so I will be able to put the `UseKNLSetting` in the IRFramework whitelist. I will do that in a separate RFE. This pull request has now been integrated. Changeset: 251347bd Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/251347bd7e589b51354a2318bfac0c71cd71bf5f Stats: 1265 lines in 5 files changed: 1203 ins; 23 del; 39 mod 8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence) Reviewed-by: kvn, vlivanov, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/17848 From chagedorn at openjdk.org Tue Mar 12 08:05:14 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 12 Mar 2024 08:05:14 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 15:56:38 GMT, Emanuel Peter wrote: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... That looks good to me, too. test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java line 31: > 29: * -XX:CompileCommand=compileonly,compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop::test > 30: * compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop > 31: * @run main/othervm compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop Suggestion: * @run main compiler.loopopts.TestEmptyPreLoopForDifferentMainLoop ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18200#pullrequestreview-1930337154 PR Review Comment: https://git.openjdk.org/jdk/pull/18200#discussion_r1521008096 From chagedorn at openjdk.org Tue Mar 12 08:13:13 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 12 Mar 2024 08:13:13 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:41:29 GMT, Koichi Sakata wrote: > This is a trivial change to remove an extra whitespace. > > A double whitespace is printed because method->print_short_name already adds a whitespace before the name. > > ### Test > > For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. > > > @Test > @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) > @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) > public static void test15(Object o) { > > > This change was only for testing, so I reverted back to the original code after the test. > > #### Execution Result > > Before the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" > - Failed comparison: [found] 1 = 11 [given] > - Matched node: > * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) > > > After the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Cons... Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18181#pullrequestreview-1930353634 From galder at openjdk.org Tue Mar 12 08:22:14 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 12 Mar 2024 08:22:14 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: <7-xJ8ujbaK_K90zgAFMgdWGkpnN6u8o088Wdt-YCh88=.230e0470-32e0-4da6-a185-65682d4713bb@github.com> References: <7-xJ8ujbaK_K90zgAFMgdWGkpnN6u8o088Wdt-YCh88=.230e0470-32e0-4da6-a185-65682d4713bb@github.com> Message-ID: On Fri, 8 Mar 2024 01:42:03 GMT, Dean Long wrote: > Your front-end changes require back-end changes, which are only implemented for x86 and aarch64. So you need a way to disable this for other platforms, or port the fix to all platforms. Minimizing the amount of platform-specific code required would also help. I'm struggling to understand what it is you think is missing in the PR. I have added the following 2 sections in such a way that they only trigger in x86 and aarch64. See [here](https://github.com/openjdk/jdk/pull/17667/files#diff-737789206706361d06d1f120e10272b62bcfdb556e8e73693f94ec87f2a6b369R238) and [here](https://github.com/openjdk/jdk/pull/17667/files#diff-e6f3ae4492965efd0d73c3f31073ec8b77e020740b009f92312658bac1e5f978R356), and as far as I understand it, that's enough to address your concerns. Please let me know if there is something I might have missed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1991014592 From epeter at openjdk.org Tue Mar 12 08:24:29 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 08:24:29 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert [v2] In-Reply-To: References: Message-ID: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18200/files - new: https://git.openjdk.org/jdk/pull/18200/files/96116022..81c84dda Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18200&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18200&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18200/head:pull/18200 PR: https://git.openjdk.org/jdk/pull/18200 From roland at openjdk.org Tue Mar 12 08:37:12 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 12 Mar 2024 08:37:12 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert [v2] In-Reply-To: References: Message-ID: <-XVbBb-rqIblr2ytrCprcBv7Kg_TW4lpgeE8ZACVIuw=.e77a67bd-d038-4a08-9969-4ce6d3e27309@github.com> On Tue, 12 Mar 2024 08:24:29 GMT, Emanuel Peter wrote: >> The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. >> >> ---------- >> >> **Problem** >> >> By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: >> the PURPLE main and post loops are already previously removed as empty-loops. >> >> At the time of the assert, the graph looks like this: >> ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) >> >> We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. >> >> The loop-tree looks essencially like this: >> >> (rr) p _ltree_root->dump() >> Loop: N0/N0 has_sfpt >> Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } >> Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre >> Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } >> Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } >> >> >> This is basically: >> >> 415 pre orange >> 298 pre PURPLE >> 200 main orange >> 398 post orange >> >> >> From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. >> There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... >> ...but not the same pre-loop as `cl` -> the `assert` fires. >> >> It seems that we assume in the code, that we can check the `_next->_head`, and if: >> 1) it is a main-loop and >> 2) that main-loop still has a pre-loop >> then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. >> But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". >> >> The loop-tree is correct here, and this is how it was arrived at: >> "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. >> If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. >> But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. >> >> So depending on what turn we take at this "427 If", we either get the order: >> >> >> 415 pre orange >> 298 pre PURPLE >> 200 main orange >> 398 post orange >> >> (the one w... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java > > Co-authored-by: Christian Hagedorn Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18200#pullrequestreview-1930405522 From yzheng at openjdk.org Tue Mar 12 10:49:20 2024 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 12 Mar 2024 10:49:20 GMT Subject: RFR: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic Message-ID: Moving array construction within BigInteger.implMultiplyToLen intrinsic candidate to its caller simplifies the intrinsic implementation in JIT compiler. ------------- Commit messages: - Simplify BigInteger.implMultiplyToLen intrinsic Changes: https://git.openjdk.org/jdk/pull/18226/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18226&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327964 Stats: 53 lines in 2 files changed: 4 ins; 49 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18226.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18226/head:pull/18226 PR: https://git.openjdk.org/jdk/pull/18226 From shade at openjdk.org Tue Mar 12 12:02:15 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 12 Mar 2024 12:02:15 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:41:20 GMT, Roman Kennke wrote: >> A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18120#pullrequestreview-1930857425 From rkennke at openjdk.org Tue Mar 12 12:10:21 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 12 Mar 2024 12:10:21 GMT Subject: RFR: 8327361: Update some comments after JDK-8139457 [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:41:20 GMT, Roman Kennke wrote: >> A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18120#issuecomment-1991500115 From rkennke at openjdk.org Tue Mar 12 12:10:22 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 12 Mar 2024 12:10:22 GMT Subject: Integrated: 8327361: Update some comments after JDK-8139457 In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 11:06:48 GMT, Roman Kennke wrote: > A follow-up to [8139457: Relax alignment of array elements](https://bugs.openjdk.org/browse/JDK-8139457) ([PR](https://git.openjdk.org/jdk/pull/11044)), some comments need an update. This pull request has now been integrated. Changeset: 5056902e Author: Roman Kennke URL: https://git.openjdk.org/jdk/commit/5056902e767d7f8485f9ff54f26df725f437fb0b Stats: 18 lines in 3 files changed: 0 ins; 0 del; 18 mod 8327361: Update some comments after JDK-8139457 Reviewed-by: galder, shade ------------- PR: https://git.openjdk.org/jdk/pull/18120 From bkilambi at openjdk.org Tue Mar 12 14:49:15 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 12 Mar 2024 14:49:15 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style Can I please ask for more reviews for this PR? Thank you in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1991824113 From duke at openjdk.org Tue Mar 12 15:41:31 2024 From: duke at openjdk.org (Oussama Louati) Date: Tue, 12 Mar 2024 15:41:31 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v9] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: Use ClassFile to get AccessFlags ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/be3f49b6..7056d444 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=07-08 Stats: 34 lines in 12 files changed: 15 ins; 0 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From dchuyko at openjdk.org Tue Mar 12 15:53:39 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Tue, 12 Mar 2024 15:53:39 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: - Resolved master conflicts - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=28 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From duke at openjdk.org Tue Mar 12 16:02:23 2024 From: duke at openjdk.org (Tom Shull) Date: Tue, 12 Mar 2024 16:02:23 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 18:56:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs. >> >> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm) >> >> This PR shows upto 19x speedup on buffer sizes of 1MB. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > make vpmadd52l/hq generic src/hotspot/cpu/x86/vm_version_x86.cpp line 312: > 310: __ lea(rsi, Address(rbp, in_bytes(VM_Version::sef_cpuid7_ecx1_offset()))); > 311: __ movl(Address(rsi, 0), rax); > 312: __ movl(Address(rsi, 4), rbx); Hi @vamsi-parasa. I believe this code as a bug in it. Here you are copying back all four registers; however, within https://github.com/openjdk/jdk/blob/782206bc97dc6ae953b0c3ce01f8b6edab4ad30b/src/hotspot/cpu/x86/vm_version_x86.hpp#L468 you only created one field. Can you please open up a JBS issue to fix this? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1521736242 From epeter at openjdk.org Tue Mar 12 16:09:14 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:09:14 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style Looks interesting! Will have a look at it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1992017586 From roland at openjdk.org Tue Mar 12 16:13:14 2024 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 12 Mar 2024 16:13:14 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 14:50:11 GMT, Christian Hagedorn wrote: >> Thanks for reviewing this. >> >>> The fix idea looks reasonable to me. I have two questions: >>> >>> * Do we really need to pin the `CastII` here? We have not pinned the `ConvL2I` before. And here I think we just want to ensure that the type is not lost. >> >> I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. >> >>> * Related to the first question, could we just use a normal dependency instead? >> >> The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. >> >>> I was also wondering if we should try to improve the type of `ConvL2I` and of `Add/Sub` (and possibly also `Mul`) nodes in general? For `ConvL2I`, we could set a better type if we know that `(int)lo <= (int)hi` and `abs(hi - lo) <= 2^32`. We still have a problem to set a better type if we have a narrow range of inputs that includes `min` and `max` (e.g. `min+1, min, max, max-1`). In this case, `ConvL2I` just uses `int` as type. Then we could go a step further and do the same type optimization for `Add/Sub` nodes by directly looking through a convert/cast node at the input type. The resulting `Add/Sub` range could maybe be represented by something better than `int`: >>> >>> Example: input type to `ConvL2I`: `[2147483647L, 2147483648L]` -> type of `ConvL2I` is `int` since we cannot represent "`[max_int, min_int]`" with two intervals otherwise. `AddI` = `ConvL2I` + 2 -> type could be improved to `[min_int+1,min_int+2]`. >>> >>> But that might succeed the scope of this fix. Going with `CastII` for now seems to be the least risk. >> >> I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. > >> I think it's good practice to set the control of a cast node. It probably doesn't make much of a difference here but we had so many issues with cast nodes that not setting control on cast makes me nervous now. > > That is indeed a general problem. The situation certainly got better by removing the code that optimized cast nodes that were pinned at If Projections (https://github.com/openjdk/jdk/commit/7766785098816cfcdae3479540cdc866c1ed18ad). By pinning the casts now, you probably want to prevent the cast nodes to be pushed through nodes such that it floats "too high" and causing unforeseenable data graph folding while control is not? > >> The problem with a normal dependency is that initially the cast and its non transformed input have the same types. So, there is a chance the cast is processed by igvn before its input changes and if that happens, the cast would then be removed. > > I see, thanks for the explanation. Then it makes sense to keep the cast node not matter what. > >> I thought about that too (I didn't go as far as you did though) and my conclusion is that the change I propose should be more robust (what if the improved type computation still misses some cases that we later find are required) and less risky. > > I agree, this fix should use casts. Would be interesting to follow this idea in a separate RFE. > That is indeed a general problem. The situation certainly got better by removing the code that optimized cast nodes that were pinned at If Projections ([7766785](https://github.com/openjdk/jdk/commit/7766785098816cfcdae3479540cdc866c1ed18ad)). By pinning the casts now, you probably want to prevent the cast nodes to be pushed through nodes such that it floats "too high" and causing unforeseenable data graph folding while control is not? Something like that. I don't see how things could go wrong in this particular case so, quite possibly, the control input is useless. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1521752781 From epeter at openjdk.org Tue Mar 12 16:25:15 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:25:15 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 16:52:07 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > fix test by adding the missing inversion > > also excluding negative values for unsigned comparison Looks like a reasonable idea. Running tests now. Will review afterwards. src/hotspot/share/opto/subnode.cpp line 1812: > 1810: int cop = cmp->Opcode(); > 1811: Node *cmp1 = cmp->in(1); > 1812: Node *cmp2 = cmp->in(2); Suggestion: Node* cmp1 = cmp->in(1); Node* cmp2 = cmp->in(2); ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1931549333 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1521763748 From epeter at openjdk.org Tue Mar 12 16:25:16 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:25:16 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:18:04 GMT, Emanuel Peter wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test by adding the missing inversion >> >> also excluding negative values for unsigned comparison > > src/hotspot/share/opto/subnode.cpp line 1812: > >> 1810: int cop = cmp->Opcode(); >> 1811: Node *cmp1 = cmp->in(1); >> 1812: Node *cmp2 = cmp->in(2); > > Suggestion: > > Node* cmp1 = cmp->in(1); > Node* cmp2 = cmp->in(2); Ah, just like @jaskarth already said ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1521764243 From epeter at openjdk.org Tue Mar 12 16:56:19 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:56:19 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:51:27 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Merge branch 'master' into round-v-exhaustive-tests >> - fix issue >> - mv tests >> - use IR framework to construct the random tests >> - Initial commit > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 124: > >> 122: bits = bits | (1 << 63); >> 123: input[ei_idx*2+1] = Double.longBitsToDouble(bits); >> 124: } > > Why do all this complicated stuff, and not just pick a random `long`, and convert it to double with `Double.longToDoubleBits`? Does this ever generate things like `+0, -0, infty, NaN` etc? > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 134: > >> 132: for (int sign = 0; sign < 2; sign++) { >> 133: int idx = ei_idx * 2 + sign; >> 134: if (res[idx] != Math.round(input[idx])) { > > Is it ok to use `Math.round` here? What if we compile it, and its computation is wrong in the compilation? This direct comparison tells me that you are not testing `NaN`s... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521817370 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521817764 From epeter at openjdk.org Tue Mar 12 16:56:19 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 16:56:19 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 27 Feb 2024 20:59:14 GMT, Hamlin Li wrote: >> HI, >> Can you have a look at this patch adding some tests for Math.round instrinsics? >> Thanks! >> >> ### FYI: >> During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. >> It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. >> Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. > > Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into round-v-exhaustive-tests > - fix issue > - mv tests > - use IR framework to construct the random tests > - Initial commit Thanks for changing to randomness. Thanks very much for your work! I have a few more requests/suggestions/questions :) test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 32: > 30: * @requires vm.compiler2.enabled > 31: * @requires (vm.cpu.features ~= ".*avx512dq.*" & os.simpleArch == "x64") | > 32: * os.simpleArch == "aarch64" We should be a able to run the tests on all platforms, with any compiler. But you can add platform restrictions to the IR rules, with `appliyIf...`. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 35: > 33: * > 34: * @library /test/lib / > 35: * @run driver compiler.vectorization.TestRoundVectorDoubleRandom Suggestion: * @run main compiler.vectorization.TestRoundVectorDoubleRandom Driver setting apparently does not allow passing flags from the outside. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 91: > 89: test_round(res, input); > 90: // skip test/verify when warming up > 91: if (runInfo.isWarmUp()) { Hmm. This means that if there is a OSR compilation during warmup, we would not verify. Are we ok with that? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 101: > 99: final int f_width = e_shift; > 100: final long f_bound = 1 << f_width; > 101: final int f_num = 256; Code style: you are generally not supposed to use under_score for variables, but camelCase, I think. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 111: > 109: fis[fidx++] = 0; > 110: for (; fidx < f_num; fidx++) { > 111: fis[fidx] = ThreadLocalRandom.current().nextLong(f_bound); Why are you not using `rand`? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 124: > 122: bits = bits | (1 << 63); > 123: input[ei_idx*2+1] = Double.longBitsToDouble(bits); > 124: } Why do all this complicated stuff, and not just pick a random `long`, and convert it to double with `Double.longToDoubleBits`? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 134: > 132: for (int sign = 0; sign < 2; sign++) { > 133: int idx = ei_idx * 2 + sign; > 134: if (res[idx] != Math.round(input[idx])) { Is it ok to use `Math.round` here? What if we compile it, and its computation is wrong in the compilation? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17753#pullrequestreview-1931612345 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521800236 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521801642 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521807957 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521809457 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521813203 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521815343 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1521816420 From mli at openjdk.org Tue Mar 12 17:15:39 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 17:15:39 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v4] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: - remove ucast from i/s/b to float - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/646955f0..cc43650b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=02-03 Stats: 93 lines in 1 file changed: 23 ins; 50 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From mli at openjdk.org Tue Mar 12 17:15:40 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 17:15:40 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v3] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:42:07 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> fix typo > > src/hotspot/cpu/riscv/riscv_v.ad line 3220: > >> 3218: ins_encode %{ >> 3219: BasicType bt = Matcher::vector_element_basic_type(this); >> 3220: if (is_floating_point_type(bt)) { > > Could `bt` (the vector element basic type) be floating point type for `VectorUCastB2X` node? I see our aarch64 counterpart has this assertion: `assert(bt == T_SHORT || bt == T_INT || bt == T_LONG, "must be");` [1]. Same question for `VectorUCastS2X` and `VectorUCastI2X` nodes. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L3752 Yeh, seems it's not, and vector api does not have this operations either. Fixed, thanks for catching. > src/hotspot/cpu/riscv/riscv_v.ad line 3397: > >> 3395: predicate(Matcher::vector_element_basic_type(n) == T_FLOAT); >> 3396: match(Set dst (VectorCastL2X src)); >> 3397: effect(TEMP_DEF dst); > > I see you added `TEMP_DEF dst` for some existing instructs like this one here. Do we really need it? > I don't see such a need when reading the overlap constraints on vector operands from the RVV spec [1]: > > > A destination vector register group can overlap a source vector register group only if one of the following holds: > > The destination EEW equals the source EEW. > > The destination EEW is smaller than the source EEW and the overlap is in the lowest-numbered part of the source register group (e.g., when LMUL=1, vnsrl.wi v0, v0, 3 is legal, but a destination of v1 is not). > > The destination EEW is greater than the source EEW, the source EMUL is at least 1, and the overlap is in the highest-numbered part of the destination register group (e.g., when LMUL=8, vzext.vf4 v0, v6 is legal, but a source of v0, v2, or v4 is not). > > > [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vec-operands You're right, thanks for sharing the information. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1521844532 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1521843212 From mli at openjdk.org Tue Mar 12 17:21:24 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 17:21:24 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v5] In-Reply-To: References: Message-ID: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - merge master - remove ucast from i/s/b to float - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics - fix typo - modify test config - clean code - add more tests - rearrange tests layout - merge master - Initial commit ------------- Changes: https://git.openjdk.org/jdk/pull/18040/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=04 Stats: 665 lines in 6 files changed: 636 ins; 11 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From duke at openjdk.org Tue Mar 12 17:34:21 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 12 Mar 2024 17:34:21 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: <8Awj08UkB3CpLNbWQbQOgOHUPh0PSARYgOK83JDEt0I=.0f634cfa-ec4b-4d49-a849-726fbfb64703@github.com> On Tue, 12 Mar 2024 15:59:59 GMT, Tom Shull wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> make vpmadd52l/hq generic > > src/hotspot/cpu/x86/vm_version_x86.cpp line 312: > >> 310: __ lea(rsi, Address(rbp, in_bytes(VM_Version::sef_cpuid7_ecx1_offset()))); >> 311: __ movl(Address(rsi, 0), rax); >> 312: __ movl(Address(rsi, 4), rbx); > > Hi @vamsi-parasa. I believe this code has a bug in it. Here you are copying back all four registers; however, within https://github.com/openjdk/jdk/blob/782206bc97dc6ae953b0c3ce01f8b6edab4ad30b/src/hotspot/cpu/x86/vm_version_x86.hpp#L468 you only created one field. > > Can you please open up a JBS issue to fix this? Hi Tom (@teshull), Thank you for identifying the issue. Please see the JBS issue filed at https://bugs.openjdk.org/browse/JDK-8327999. Will float a new PR to fix this issue soon. Thanks, Vamsi ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1521869268 From chagedorn at openjdk.org Tue Mar 12 17:36:14 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 12 Mar 2024 17:36:14 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18170#pullrequestreview-1931804552 From epeter at openjdk.org Tue Mar 12 17:45:14 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 17:45:14 GMT Subject: RFR: 8323972: C2 compilation fails with assert(!x->as_Loop()->is_loop_nest_inner_loop()) failed: loop was transformed In-Reply-To: References: Message-ID: <_Qm0japYCaj72QGczOLYyKgmkaAA4P5AhG6QPfmd3Ys=.d0e57b13-4cb2-4f05-b902-e655d7f2a123@github.com> On Thu, 22 Feb 2024 14:36:52 GMT, Roland Westrelin wrote: > Long counted loop are transformed into a loop nest of 2 "regular" > loops and in a subsequent loop opts round, the inner loop is > transformed into a counted loop. The limit for the inner loop is set, > when the loop nest is created, so it's expected there's no need for a > loop limit check when the counted loop is created. The assert fires > because, when the counted loop is created, it is found that it needs a > loop limit check. The reason for that is that the limit is > transformed, between nest creation and counted loop creation, in a way > that the range of values of the inner loop's limit becomes > unknown. The limit when the nest is created is: > > > 111 ConL === 0 [[ 112 ]] #long:-9223372034707292158 > 106 Phi === 105 20 94 [[ 112 ]] #long:9223372034707292160..9223372034707292164:www !orig=72 !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 112 AddL === _ 106 111 [[ 122 ]] !orig=[110] > 122 ConvL2I === _ 112 [[ ]] #int > > > The type of 122 is `2..6` but it is then transformed to: > > > 106 Phi === 105 20 154 [[ 191 130 137 ]] #long:9223372034707292160..9223372034707292164:www !orig=[72] !jvms: TestInaccurateInnerLoopLimit::test @ bci:12 (line 40) > 191 ConvL2I === _ 106 [[ 196 ]] #int > 195 ConI === 0 [[ 196 ]] #int:max-1 > 196 SubI === _ 195 191 [[ 201 127 ]] !orig=[123] > > > That is the `(ConvL2I (AddL ...))` is transformed into a `(SubI > (ConvL2I ))`. `ConvL2I` for an input that's out of the int range of > values returns TypeInt::INT and the bounds of the limit are lost. I > propose adding a `CastII` after the `ConvL2I` so the range of values > of the limit doesn't get lost. Looks reasonable, but these ad-hoc CastII also make me nervous. What worries me with adding such "Ad-Hoc" CastII nodes is that elsewhere a very similar computation may not have the same tight type. And then you have a tight type somewhere, and a loose type elsewhere. This is how we get the data-flow collapsing and the cfg not folding. @rwestrel please wait for our testing to complete, I just launched it. test/hotspot/jtreg/compiler/longcountedloops/TestInaccurateInnerLoopLimit.java line 40: > 38: > 39: public static void test() { > 40: for (long i = 9223372034707292164L; i > 9223372034707292158L; i += -2L) { } I'm always amazed at how such simple tests can fail. Is there any way we can improve the test coverage for Long loops? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17965#pullrequestreview-1931732253 PR Comment: https://git.openjdk.org/jdk/pull/17965#issuecomment-1992220126 PR Review Comment: https://git.openjdk.org/jdk/pull/17965#discussion_r1521846724 From epeter at openjdk.org Tue Mar 12 17:54:15 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 17:54:15 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style A suggestion about naming: We now have a few synonyms: Unordered reduction non-strict order reduction associative reduction I think I introduced the "unordered" one. Not proud of it any more. I think we should probably use (non) associative everywhere. That is the technical/mathematical term. We can use synonyms in the comments to make the explanation more clear though. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1992235715 From epeter at openjdk.org Tue Mar 12 18:02:14 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Mar 2024 18:02:14 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi wrote: >> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2. >> >> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value. >> >> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones. >> >> [AArch64] >> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2]. >> >> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`. >> >> No effects on other platforms. >> >> [Performance] >> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit). >> >> ADDLanes >> >> Benchmark Before After Unit >> FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms >> >> >> Final code is as below: >> >> Before: >> ` fadda z17.s, p7/m, z17.s, z16.s >> ` >> After: >> >> faddp v17.4s, v21.4s, v21.4s >> faddp s18, v17.2s >> fadd s18, s18, s19 >> >> >> >> >> [Test] >> Full jtreg passed on AArch64 and x86. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 >> [2] https://bugs.openjdk.org/browse/JDK-8275275 >> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316 > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments for changes in backend rules and code style On a more visionary note: We should make sure that the actual `ReductionNode` gets moved out of the loop, when possible. [JDK-8309647](https://bugs.openjdk.org/browse/JDK-8309647) [Vector API] Move Reduction outside loop when possible We have an RFE for that, I have not yet have time or priority for it. Of course the user can already move it out of the loop themselves. If the `ReductionNode` is out of the loop, then you usually just have a very cheap accumulation inside the loop, a `MulVF` for example. That would certainly be cheap enough to allow vectorization. So in that case, your optimization here should not just affect SVE, but also NEON and x86. Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1992247959 From dlong at openjdk.org Tue Mar 12 18:19:16 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 12 Mar 2024 18:19:16 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Mon, 4 Mar 2024 09:12:12 GMT, Galder Zamarre?o wrote: >> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures. >> >> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64: >> >> >> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op >> ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op >> ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op >> ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op >> ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op >> ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op >> ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op >> ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op >> ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op >> ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op >> ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op >> ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op >> >> >> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this. >> >> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g. >> >> >> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1" >> ... >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0 >> >> >> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts? >> >>... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Merge branch 'master' into topic.0131.c1-array-clone > - Reserve necessary frame map space for clone use cases > - 8302850: C1 primitive array clone intrinsic in graph > > * Combine array length, new type array and arraycopy for clone in c1 graph. > * Add OmitCheckFlags to skip arraycopy checks. > * Instantiate ArrayCopyStub only if necessary. > * Avoid zeroing newly created arrays for clone. > * Add array null after c1 clone compilation test. > * Pass force reexecute to intrinsic via value stack. > This is needed to be able to deoptimize correctly this intrinsic. > * When new type array or array copy are used for the clone intrinsic, > their state needs to be based on the state before for deoptimization > to work as expected. > - Revert "8302850: Primitive array copy C1 intrinsic for aarch64 and x86" > > This reverts commit fe5d916724614391a685bbef58ea939c84197d07. > - 8302850: Link code emit infos for null check and alloc array > - 8302850: Null check array before getting its length > > * Added a jtreg test to verify the null check works. > Without the fix this test fails with a SEGV crash. > - 8302850: Force reexecuting clone in case of a deoptimization > > * Copy state including locals for clone > so that reexecution works as expected. > - 8302850: Avoid instantiating array copy stub for clone use cases > - 8302850: Primitive array copy C1 intrinsic for aarch64 and x86 > > * Clone calls that involve Phi nodes are not supported. > * Add unimplemented stubs for other platforms. IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1992277510 From kxu at openjdk.org Tue Mar 12 18:43:47 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 12 Mar 2024 18:43:47 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v3] In-Reply-To: References: Message-ID: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: modification per code review suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/17a9dc37..06b7da36 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=01-02 Stats: 8 lines in 2 files changed: 1 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Tue Mar 12 18:43:47 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 12 Mar 2024 18:43:47 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v2] In-Reply-To: <7b1BIvQpmoLhSzWqQ7haDBTQU1NDuddEm1TK7AgWnwY=.0e5222cc-b20d-4a19-94db-9cad00c6dbff@github.com> References: <7b1BIvQpmoLhSzWqQ7haDBTQU1NDuddEm1TK7AgWnwY=.0e5222cc-b20d-4a19-94db-9cad00c6dbff@github.com> Message-ID: On Mon, 11 Mar 2024 21:13:21 GMT, Jasmine Karthikeyan wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> fix test by adding the missing inversion >> >> also excluding negative values for unsigned comparison > > I think the cleanup looks good! I have mostly stylistic suggestions here. Also, the copyright header in `subnode.cpp` should be updated to read 2024. Thanks @jaskarth and @eme64 for the review. I've pushed a new commit to address the following: - Updated license header year to 2024 - Explicit `nullptr` comparison - `Node* var` for pointer types - Test moved to `c2.irTests`, added `@bug` and `@summary` tags ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-1992315599 From duke at openjdk.org Tue Mar 12 19:15:21 2024 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 12 Mar 2024 19:15:21 GMT Subject: RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v13] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 15:59:59 GMT, Tom Shull wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> make vpmadd52l/hq generic > > src/hotspot/cpu/x86/vm_version_x86.cpp line 312: > >> 310: __ lea(rsi, Address(rbp, in_bytes(VM_Version::sef_cpuid7_ecx1_offset()))); >> 311: __ movl(Address(rsi, 0), rax); >> 312: __ movl(Address(rsi, 4), rbx); > > Hi @vamsi-parasa. I believe this code has a bug in it. Here you are copying back all four registers; however, within https://github.com/openjdk/jdk/blob/782206bc97dc6ae953b0c3ce01f8b6edab4ad30b/src/hotspot/cpu/x86/vm_version_x86.hpp#L468 you only created one field. > > Can you please open up a JBS issue to fix this? Hi Tom (@teshull), pls see the PR to fix this issue: https://github.com/openjdk/jdk/pull/18248 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1521996254 From shade at openjdk.org Tue Mar 12 19:38:22 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 12 Mar 2024 19:38:22 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal Message-ID: See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. --- x86_64 EC2, applications/ctw/modules CTW jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total --- Mac M1, applications/ctw/modules/java.base CTW jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/18249/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18249&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8325613 Stats: 26 lines in 2 files changed: 24 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18249.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18249/head:pull/18249 PR: https://git.openjdk.org/jdk/pull/18249 From dlong at openjdk.org Tue Mar 12 19:53:13 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 12 Mar 2024 19:53:13 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/18183#pullrequestreview-1932260072 From mli at openjdk.org Tue Mar 12 20:19:17 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:19:17 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:46:17 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Merge branch 'master' into round-v-exhaustive-tests >> - fix issue >> - mv tests >> - use IR framework to construct the random tests >> - Initial commit > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 91: > >> 89: test_round(res, input); >> 90: // skip test/verify when warming up >> 91: if (runInfo.isWarmUp()) { > > Hmm. This means that if there is a OSR compilation during warmup, we would not verify. Are we ok with that? I'm not sure if it's necessary to verify that situation. But, if we verify the result during the warmup, it will rather longer time to finish the test. Please let me know if we need to verify during warmup. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522060608 From mli at openjdk.org Tue Mar 12 20:19:19 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:19:19 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 16:52:40 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 124: >> >>> 122: bits = bits | (1 << 63); >>> 123: input[ei_idx*2+1] = Double.longBitsToDouble(bits); >>> 124: } >> >> Why do all this complicated stuff, and not just pick a random `long`, and convert it to double with `Double.longToDoubleBits`? > > Does this ever generate things like `+0, -0, infty, NaN` etc? It's testing following cases: 1. all the `e` range, e.g. for double it's 11 bits, for float it's 8 bits 2. for `f` I add a special value `0` explicitly `fis[fidx++] = 0;` 3. for sign, both `+` and `-` are tested. So, yes, it will test cases like `+/- 0, infty, NaN`. >> test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 134: >> >>> 132: for (int sign = 0; sign < 2; sign++) { >>> 133: int idx = ei_idx * 2 + sign; >>> 134: if (res[idx] != Math.round(input[idx])) { >> >> Is it ok to use `Math.round` here? What if we compile it, and its computation is wrong in the compilation? > > This direct comparison tells me that you are not testing `NaN`s... > Is it ok to use Math.round here? What if we compile it, and its computation is wrong in the compilation? It's a bug, will fix. > This direct comparison tells me that you are not testing NaNs... this comparison is between long value, for NaN Math.round(NaN) == 0. Or maybe I misunderstood your question? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522059691 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522059293 From mli at openjdk.org Tue Mar 12 20:26:25 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:26:25 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: refine code; fix bug ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/7eeb3141..e1127c76 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=02-03 Stats: 51 lines in 2 files changed: 8 ins; 5 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From mli at openjdk.org Tue Mar 12 20:29:13 2024 From: mli at openjdk.org (Hamlin Li) Date: Tue, 12 Mar 2024 20:29:13 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: <54MAbAxe9ilnn_NtGTt04-k1IOqhPlegUk4XlMjQDGc=.6762ca45-b6e6-44ab-aeb9-9e0159223df3@github.com> On Tue, 12 Mar 2024 16:54:03 GMT, Emanuel Peter wrote: > Thanks for changing to randomness. Thanks very much for your work! > > I have a few more requests/suggestions/questions :) Thanks for detailed reviewing and suggestion! :) I resolved some comments, and tried to answer some of your questions, please have a look again. Also have a question: currently I'm generating golden value in following way: @DontCompile long golden_round(double d) { return Math.round(d); } Will it make sure Math.round invocation here are the interpreter version? Or maybe it can be calling the intrinsic version? If that's the case, I think one way to resolve this issue is to copy the piece of library code of Math.round here, I see some existing test cases also use this way to get the golden value. How do you think about it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17753#issuecomment-1992519401 From jkarthikeyan at openjdk.org Tue Mar 12 22:13:14 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 12 Mar 2024 22:13:14 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v3] In-Reply-To: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> References: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> Message-ID: On Tue, 12 Mar 2024 18:43:47 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > modification per code review suggestions Thanks for the update! Just one more thing from me. test/hotspot/jtreg/compiler/c2/irTests/TestBoolNodeGvn.java line 39: > 37: * @summary Refactor boolean node tautology transformations > 38: * @library /test/lib / > 39: * @run driver compiler.c2.TestBoolNodeGvn Suggestion: * @run driver compiler.c2.irTests.TestBoolNodeGvn Since the test's package changed, this'll need to be changed as well. ------------- PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1932781890 PR Review Comment: https://git.openjdk.org/jdk/pull/18198#discussion_r1522167995 From ddong at openjdk.org Wed Mar 13 00:03:17 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 00:03:17 GMT Subject: RFR: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18170#issuecomment-1992757081 From ddong at openjdk.org Wed Mar 13 00:03:18 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 00:03:18 GMT Subject: Integrated: 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code In-Reply-To: References: Message-ID: On Fri, 8 Mar 2024 14:08:57 GMT, Denghui Dong wrote: > Hi, > > Please help review this change that moves _instruction_for_operand into ASSERT block since it is only read by assertion code in c1_LinearScan.cpp. > > Thanks This pull request has now been integrated. Changeset: 5d4bfad1 Author: Denghui Dong URL: https://git.openjdk.org/jdk/commit/5d4bfad12b650b9f7c512a071830c58b8f1d020b Stats: 25 lines in 2 files changed: 12 ins; 9 del; 4 mod 8327693: C1: LIRGenerator::_instruction_for_operand is only read by assertion code Reviewed-by: gli, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/18170 From kxu at openjdk.org Wed Mar 13 02:05:39 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Wed, 13 Mar 2024 02:05:39 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: update the package name for tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/06b7da36..e2eb8bf9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Wed Mar 13 02:05:40 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Wed, 13 Mar 2024 02:05:40 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v3] In-Reply-To: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> References: <6mb_BOei2bIRzPvulo4SkaWGa9EXjiBIFfKTIAAWdCU=.86b2b6f0-7e06-4b4d-9881-593577b43184@github.com> Message-ID: On Tue, 12 Mar 2024 18:43:47 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > modification per code review suggestions Oops. Package name updated. Sorry for such a rookie mistake! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-1993109007 From ddong at openjdk.org Wed Mar 13 02:24:40 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 02:24:40 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v2] In-Reply-To: References: Message-ID: > Hi, > > Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. > > There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. > > I am not very sure if I have changed all the places that should be. > > Performance: > > I wrote a simple JMH included in this patch. > > On Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > > Before this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.270 ? 0.011 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.479 ? 0.012 ns/op > > > After this change: > > > Benchmark Mode Cnt Score Error Units > C1PreserveFramePointer.WithPreserveFramePointer.calculate avgt 16 15.264 ? 0.006 ns/op > C1PreserveFramePointer.WithoutPreserveFramePointer.calculate avgt 16 14.057 ? 0.005 ns/op > > > > Testing: fastdebug tier1-4 on Linux x64 Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: fix: rbp should be callee saved ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18167/files - new: https://git.openjdk.org/jdk/pull/18167/files/6e8020fb..a6270736 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=00-01 Stats: 18 lines in 5 files changed: 8 ins; 3 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/18167.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18167/head:pull/18167 PR: https://git.openjdk.org/jdk/pull/18167 From ddong at openjdk.org Wed Mar 13 02:33:12 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 02:33:12 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v2] In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 19:19:32 GMT, Dean Long wrote: >> Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: >> >> fix: rbp should be callee saved > > src/hotspot/cpu/x86/c1_Defs_x86.hpp line 47: > >> 45: >> 46: #ifdef _LP64 >> 47: #define UNALLOCATED 3 // rsp, r15, r10 > > This affects pd_nof_caller_save_cpu_regs_frame_map below, but RBP is callee-saved, not caller-saved. Yes. I updated the patch. I want to confirm, if we treat RBP as caller saved, is there any correctness problem? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18167#discussion_r1522399378 From jkarthikeyan at openjdk.org Wed Mar 13 04:08:13 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 13 Mar 2024 04:08:13 GMT Subject: RFR: 8327381 Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v4] In-Reply-To: References: Message-ID: <92BvqrZ-rwtf2tU1yJuKAVvgqVaZd8Q7Gfi4PNZBBk8=.ce7d0e76-ea10-47fc-b5c0-78ab7692b482@github.com> On Wed, 13 Mar 2024 02:05:39 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update the package name for tests No worries, looks good to me now :) ------------- Marked as reviewed by jkarthikeyan (Author). PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-1933123222 From ddong at openjdk.org Wed Mar 13 06:49:30 2024 From: ddong at openjdk.org (Denghui Dong) Date: Wed, 13 Mar 2024 06:49:30 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v3] In-Reply-To: References: Message-ID: > Hi, > > Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. > > There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. > > I am not very sure if I have changed all the places that should be. > > Testing: fastdebug tier1-4 on Linux x64 Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: delete jmh ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18167/files - new: https://git.openjdk.org/jdk/pull/18167/files/a6270736..972b12ee Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18167&range=01-02 Stats: 74 lines in 1 file changed: 0 ins; 74 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18167.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18167/head:pull/18167 PR: https://git.openjdk.org/jdk/pull/18167 From epeter at openjdk.org Wed Mar 13 06:59:21 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 13 Mar 2024 06:59:21 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 20:26:25 GMT, Hamlin Li wrote: >> HI, >> Can you have a look at this patch adding some tests for Math.round instrinsics? >> Thanks! >> >> ### FYI: >> During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. >> It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. >> Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > refine code; fix bug Thanks for adjusting the IR rules! I still have trouble reviewing your input value generation, and a few other comments. Thanks for the work you are putting in, I really appreciate it ? test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 30: > 28: * @summary Test vector intrinsic for Math.round(double) in full 64 bits range. > 29: * > 30: * @requires vm.compiler2.enabled Do we really require C2? We should also run this for C1, and any other potential compiler. test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 110: > 108: fis[fidx] = 1 << fidx; > 109: } > 110: fis[fidx++] = 0; The zero is now always in the same spot. What if vectorization messes up only in a specific slot, and then never encounters that zero? We would maybe never see a zero in that bad spot. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17753#pullrequestreview-1933271647 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522615886 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522626600 From epeter at openjdk.org Wed Mar 13 06:59:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 13 Mar 2024 06:59:22 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 20:16:02 GMT, Hamlin Li wrote: >> Does this ever generate things like `+0, -0, infty, NaN` etc? > > It's testing following cases: > 1. all the `e` range, e.g. for double it's 11 bits, for float it's 8 bits > 2. for `f` I add a special value `0` explicitly `fis[fidx++] = 0;` > 3. for sign, both `+` and `-` are tested. > > So, yes, it will test cases like `+/- 0, infty, NaN`. Can you refactor or at least comment the code a little better, or use more expressive variable names? I'd have to spend a bit of time to understand your generation method here, and if I think that it is exhaustive and covers the special cases with enough frequency. >> This direct comparison tells me that you are not testing `NaN`s... > >> Is it ok to use Math.round here? What if we compile it, and its computation is wrong in the compilation? > > It's a bug, will fix. > >> This direct comparison tells me that you are not testing NaNs... > > this comparison is between long value, for NaN Math.round(NaN) == 0. Or maybe I misunderstood your question? Yes, you are right. I somehow thought that `Math.round` returns a float/double. But it is int/long. So exact comparison is good ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522626925 PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1522620080 From stuefe at openjdk.org Wed Mar 13 07:25:20 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 13 Mar 2024 07:25:20 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm Message-ID: ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. According to POSIX, it should be valid to pass into setlocale output from setlocale. However, glibc seems to delete the old string when calling setlocale again: https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 Best to make a copy, and pass in the copy to setlocale. ------------- Commit messages: - JDK-8327986-ASAN-reports-use-after-free-in-DirectivesParserTest-empty_object_vm Changes: https://git.openjdk.org/jdk/pull/18235/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18235&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327986 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18235.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18235/head:pull/18235 PR: https://git.openjdk.org/jdk/pull/18235 From sspitsyn at openjdk.org Wed Mar 13 07:46:18 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 07:46:18 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 15:53:39 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: > > - Resolved master conflicts > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 src/hotspot/share/ci/ciEnv.cpp line 1144: > 1142: > 1143: if (entry_bci == InvocationEntryBci) { > 1144: if (TieredCompilation) { Just a naive question. Why this check has been removed? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1522682325 From sspitsyn at openjdk.org Wed Mar 13 07:51:18 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 07:51:18 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 15:53:39 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: > > - Resolved master conflicts > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - Merge branch 'openjdk:master' into compiler-directives-force-update > - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 src/hotspot/share/services/diagnosticCommand.cpp line 928: > 926: DCmdWithParser(output, heap), > 927: _filename("filename", "Name of the directives file", "STRING", true), > 928: _refresh("-r", "Refresh affected methods.", "BOOLEAN", false, "false") { Nit: The dot is not needed at the end, I think. The same applies to lines: 945, 970 and 987. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1522688369 From roland at openjdk.org Wed Mar 13 08:01:12 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 13 Mar 2024 08:01:12 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total Looks reasonable to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18249#pullrequestreview-1933379637 From djelinski at openjdk.org Wed Mar 13 08:06:13 2024 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Wed, 13 Mar 2024 08:06:13 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 13:57:53 GMT, Thomas Stuefe wrote: > ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. > > According to POSIX, it should be valid to pass into setlocale output from setlocale. > > However, glibc seems to delete the old string when calling setlocale again: > > https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 > > Best to make a copy, and pass in the copy to setlocale. test/hotspot/gtest/compiler/test_directivesParser.cpp line 39: > 37: // These tests require the "C" locale to correctly parse decimal values > 38: DirectivesParserTest() : _locale(os::strdup(setlocale(LC_NUMERIC, nullptr), mtTest)) { > 39: setlocale(LC_NUMERIC, "C"); Would it fix the issue if we did this instead? Suggestion: DirectivesParserTest() : _locale(setlocale(LC_NUMERIC, "C")) { seems to me that the string returned by setlocale is only valid until the next setlocale call, and currently we call setlocale twice in the constructor, and save the result of the first call. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18235#discussion_r1522707838 From rcastanedalo at openjdk.org Wed Mar 13 08:17:17 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 13 Mar 2024 08:17:17 GMT Subject: RFR: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. Thanks for reviewing, Axel and Deal! And thanks Axel for trying out Dean's suggestion! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18183#issuecomment-1993783549 From rcastanedalo at openjdk.org Wed Mar 13 08:17:18 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 13 Mar 2024 08:17:18 GMT Subject: Integrated: 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 09:26:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset introduces a third `TEMP` register for the intermediate computations in the `cmpFastLockLightweight` and `cmpFastUnlockLightweight` aarch64 ADL instructions, instead of using the legacy `box` register. This prevents potential overwrites, and consequent erroneous uses, of `box`. > > Introducing a new `TEMP` seems conceptually simpler (and not necessarily worse from a performance perspective) than pre-assigning `box` an arbitrary register and marking it as `USE_KILL`, an alternative also suggested in the [JBS issue description](https://bugs.openjdk.org/browse/JDK-8326385). Compared to mainline, the changeset does not lead to any statistically significant regression in a set of locking-intensive benchmarks from DaCapo, Renaissance, SPECjvm2008, and SPECjbb2015. > > #### Testing > > - tier1-7 (linux-aarch64 and macosx-aarch64) with `-XX:LockingMode=2`. This pull request has now been integrated. Changeset: 07acc0bb Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/07acc0bbad2cd5b37013d17785ca466429966a0d Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod 8326385: [aarch64] C2: lightweight locking nodes kill the box register without specifying this effect Reviewed-by: aboldtch, dlong ------------- PR: https://git.openjdk.org/jdk/pull/18183 From stuefe at openjdk.org Wed Mar 13 08:30:13 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 13 Mar 2024 08:30:13 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: References: Message-ID: <-IZ9AM14NCXAh8wTczqvc9WM77wOt2D9JD7ACF0SGxg=.80d254b7-5b92-451f-9897-5edfccb389df@github.com> On Wed, 13 Mar 2024 08:03:49 GMT, Daniel Jeli?ski wrote: >> ASAN reports a use-after-free, because we feed the string we got from `setlocale` back to `setlocale`, but the libc owns this string, and the libc decided to free it in the meantime. >> >> According to POSIX, it should be valid to pass into setlocale output from setlocale. >> >> However, glibc seems to delete the old string when calling setlocale again: >> >> https://codebrowser.dev/glibc/glibc/locale/setlocale.c.html#198 >> >> Best to make a copy, and pass in the copy to setlocale. > > test/hotspot/gtest/compiler/test_directivesParser.cpp line 39: > >> 37: // These tests require the "C" locale to correctly parse decimal values >> 38: DirectivesParserTest() : _locale(os::strdup(setlocale(LC_NUMERIC, nullptr), mtTest)) { >> 39: setlocale(LC_NUMERIC, "C"); > > Would it fix the issue if we did this instead? > > Suggestion: > > DirectivesParserTest() : _locale(setlocale(LC_NUMERIC, "C")) { > > > seems to me that the string returned by setlocale is only valid until the next setlocale call, and currently we call setlocale twice in the constructor, and save the result of the first call. No. The first setlocate call returns the pointer to the last locale, which becomes invalid. Changing the input string on the first setlocale call won't change that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18235#discussion_r1522743065 From djelinski at openjdk.org Wed Mar 13 09:19:14 2024 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Wed, 13 Mar 2024 09:19:14 GMT Subject: RFR: JDK-8327986: ASAN reports use-after-free in DirectivesParserTest.empty_object_vm In-Reply-To: <-IZ9AM14NCXAh8wTczqvc9WM77wOt2D9JD7ACF0SGxg=.80d254b7-5b92-451f-9897-5edfccb389df@github.com> References: <-IZ9AM14NCXAh8wTczqvc9WM77wOt2D9JD7ACF0SGxg=.80d254b7-5b92-451f-9897-5edfccb389df@github.com> Message-ID: <9RM-tJQcA0tEL5iOy7UOE6XJAqIphmYfJyy6Ydgkmm4=.e96b3e85-bd3c-426d-aeb7-3e868294fbb3@github.com> On Wed, 13 Mar 2024 08:27:27 GMT, Thomas Stuefe wrote: >> test/hotspot/gtest/compiler/test_directivesParser.cpp line 39: >> >>> 37: // These tests require the "C" locale to correctly parse decimal values >>> 38: DirectivesParserTest() : _locale(os::strdup(setlocale(LC_NUMERIC, nullptr), mtTest)) { >>> 39: setlocale(LC_NUMERIC, "C"); >> >> Would it fix the issue if we did this instead? >> >> Suggestion: >> >> DirectivesParserTest() : _locale(setlocale(LC_NUMERIC, "C")) { >> >> >> seems to me that the string returned by setlocale is only valid until the next setlocale call, and currently we call setlocale twice in the constructor, and save the result of the first call. > > No. The first setlocate call returns the pointer to the last locale, which becomes invalid. Changing the input string on the first setlocale call won't change that. Ah. I was misled by the `setlocale` docs: > The string returned is such that a subsequent call with that string and its associated category will restore that part of the process's locale. apparently it doesn't restore them _to the previous value_, as I incorrectly assumed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18235#discussion_r1522820269 From thartmann at openjdk.org Wed Mar 13 10:21:14 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 13 Mar 2024 10:21:14 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v5] In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 14:22:11 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > format That looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18080#pullrequestreview-1933708125 From duke at openjdk.org Wed Mar 13 10:34:42 2024 From: duke at openjdk.org (Oussama Louati) Date: Wed, 13 Mar 2024 10:34:42 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v10] In-Reply-To: References: Message-ID: > Completion of the first version of the migration for several tests. > > These tests, which are now using the classfile API mostly, are responsible for testing various aspects including: > > - Generate constant pool entries filled with method handles and method types. > - Create numerous invokedynamic instructions with correct bootstrap methods and incorrect bootstrap methods for testing error handling and exception catching. > - Produce many invokedynamic instructions with a specific constant pool entry. Oussama Louati has updated the pull request incrementally with two additional commits since the last revision: - halfway through this migration, had to switch to other test group and out these aside - Use ClassFile to get AccessFlags ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17834/files - new: https://git.openjdk.org/jdk/pull/17834/files/7056d444..527384d3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17834&range=08-09 Stats: 40 lines in 8 files changed: 20 ins; 10 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17834.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17834/head:pull/17834 PR: https://git.openjdk.org/jdk/pull/17834 From fyang at openjdk.org Wed Mar 13 13:38:15 2024 From: fyang at openjdk.org (Fei Yang) Date: Wed, 13 Mar 2024 13:38:15 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v5] In-Reply-To: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> References: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> Message-ID: On Tue, 12 Mar 2024 17:21:24 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - merge master > - remove ucast from i/s/b to float > - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics > - fix typo > - modify test config > - clean code > - add more tests > - rearrange tests layout > - merge master > - Initial commit Thanks for the quick update. Two minor comments remain. Looks good otherwise. src/hotspot/cpu/riscv/assembler_riscv.hpp line 1284: > 1282: INSN(vfwcvt_f_f_v, 0b1010111, 0b001, 0b01100, 0b010010); > 1283: INSN(vfwcvt_rtz_x_f_v, 0b1010111, 0b001, 0b01111, 0b010010); > 1284: INSN(vfwcvt_rtz_xu_f_v, 0b1010111, 0b001, 0b01110, 0b010010); I see no use of these newly added assembler functions. So test coverage would be an issue. Maybe add them in the future when they are really needed? src/hotspot/cpu/riscv/riscv_v.ad line 3215: > 3213: %} > 3214: > 3215: instruct vcvtUBtoX_extend(vReg dst, vReg src) %{ Personally, I don't like the `_extend` suffix in the instruct name. I prefer names like `vzeroExtBtoX` which make it explicit that this will zero-extend the vector elements. Or simply `vcvtUBtoX`. ------------- PR Review: https://git.openjdk.org/jdk/pull/18040#pullrequestreview-1934187709 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523264264 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523273365 From chagedorn at openjdk.org Wed Mar 13 14:01:24 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 13 Mar 2024 14:01:24 GMT Subject: RFR: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class [v5] In-Reply-To: References: Message-ID: On Wed, 6 Mar 2024 14:22:11 GMT, Christian Hagedorn wrote: >> In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. >> >> #### Redo refactoring of `create_bool_from_template_assertion_predicate()` >> On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). >> >> #### Share data graph cloning code - start from existing code >> This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: >> >> 1. Collect data nodes to clone by using a node filter >> 2. Clone the collected nodes (their data and control inputs still point to the old nodes) >> 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. >> >> #### Shared data graph cloning class >> Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: >> >> 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] >> 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to ... > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > format Thanks for your review Tobias! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18080#issuecomment-1994470269 From chagedorn at openjdk.org Wed Mar 13 14:01:25 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 13 Mar 2024 14:01:25 GMT Subject: Integrated: 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class In-Reply-To: References: Message-ID: On Fri, 1 Mar 2024 13:27:38 GMT, Christian Hagedorn wrote: > In the review process of https://github.com/openjdk/jdk/pull/16877, we identified an existing issue in `create_bool_from_template_assertion_predicate()` which is also still present in the refactoring of https://github.com/openjdk/jdk/pull/16877: In rare cases, we could endlessly re-process nodes in the DFS walk since a visited set is missing. This needs to be addressed. > > #### Redo refactoring of `create_bool_from_template_assertion_predicate()` > On top of that bug, the refactored version of https://github.com/openjdk/jdk/pull/16877 is still quite complicated to understand since it tries to do multiple steps simultaneously. We've decided to redo the refactoring and better separate the steps to simplify the algorithm. By doing so, we also want to fix the existing bug. This work is split into three separate RFEs (JDK-8327109, JDK-8327110, and JDK-8327111). > > #### Share data graph cloning code - start from existing code > This first PR starts with the existing code found in `clone_nodes_with_same_ctrl()` which is called by `create_new_if_for_predicate()`. `clone_nodes_with_same_ctrl()` already does the data graph cloning in 3 separate steps which can be used as foundation: > > 1. Collect data nodes to clone by using a node filter > 2. Clone the collected nodes (their data and control inputs still point to the old nodes) > 3. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. In this pass, also fix the control inputs of any pinned data node from the old uncommon projection to the new one. > > #### Shared data graph cloning class > Some of these steps above are shared with the data graph cloning done in `create_bool_from_template_assertion_predicate()` (refactored in JDK-8327110 and JDK-8327111). We therefore extract them in this patch such that we can reuse it in the refactoring for `create_bool_from_template_assertion_predicate()` later. We create a new `DataNodeGraph` class which does the following (to be shared) cloning of a data graph: > > 1. Take a collection of data nodes (the collection step is different in `clone_nodes_with_same_ctrl()` compared to `create_bool_from_template_assertion_predicate()` and thus cannot be shared) and clone them. [Same as step 2 above] > 2. Fix the cloned data node inputs pointing to the old nodes to the cloned inputs by using an old->new mapping. [Same as first part of step 3 above but drop the second part of rewiring control inputs which is specific to `clone_nodes_with_same_ctrl()`] > > `... This pull request has now been integrated. Changeset: 7d8561d5 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/7d8561d56bf064e388417530b9b71755e4ac3f76 Stats: 137 lines in 5 files changed: 72 ins; 34 del; 31 mod 8327109: Refactor data graph cloning used in create_new_if_for_predicate() into separate class Reviewed-by: epeter, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/18080 From mli at openjdk.org Wed Mar 13 16:32:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:32:51 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v5] In-Reply-To: References: Message-ID: > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: add comments; refine code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/e1127c76..2afa8160 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=03-04 Stats: 36 lines in 2 files changed: 24 ins; 6 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From mli at openjdk.org Wed Mar 13 16:32:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:32:51 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v3] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:55:55 GMT, Emanuel Peter wrote: >> It's testing following cases: >> 1. all the `e` range, e.g. for double it's 11 bits, for float it's 8 bits >> 2. for `f` I add a special value `0` explicitly `fis[fidx++] = 0;` >> 3. for sign, both `+` and `-` are tested. >> >> So, yes, it will test cases like `+/- 0, infty, NaN`. > > Can you refactor or at least comment the code a little better, or use more expressive variable names? > I'd have to spend a bit of time to understand your generation method here, and if I think that it is exhaustive and covers the special cases with enough frequency. Sure, I will add some comments to illustrate how it works. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1523575330 From mli at openjdk.org Wed Mar 13 16:32:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:32:51 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:55:38 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> refine code; fix bug > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectorDoubleRandom.java line 110: > >> 108: fis[fidx] = 1 << fidx; >> 109: } >> 110: fis[fidx++] = 0; > > The zero is now always in the same spot. What if vectorization messes up only in a specific slot, and then never encounters that zero? We would maybe never see a zero in that bad spot. Good point, will make it random, hope this will resolve the issue. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17753#discussion_r1523577070 From mli at openjdk.org Wed Mar 13 16:39:15 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:39:15 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v4] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:56:58 GMT, Emanuel Peter wrote: > Thanks for adjusting the IR rules! > > I still have trouble reviewing your input value generation, and a few other comments. > > Thanks for the work you are putting in, I really appreciate it ? Thanks for your suggestion and patient reviewing! :) I just updated the patch, please have a look again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17753#issuecomment-1994917255 From mli at openjdk.org Wed Mar 13 16:45:19 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 16:45:19 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v5] In-Reply-To: References: <8eua4Xmcp4X6a8a8mAithQ4UOyKYV7IgE3KWlkUOHXs=.50937123-aed8-4e57-9f1c-d6927c88eb87@github.com> Message-ID: On Wed, 13 Mar 2024 13:29:32 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - merge master >> - remove ucast from i/s/b to float >> - revert some chnage; remove effect(TEMP_DEF dst) for non-extending intrinsics >> - fix typo >> - modify test config >> - clean code >> - add more tests >> - rearrange tests layout >> - merge master >> - Initial commit > > src/hotspot/cpu/riscv/assembler_riscv.hpp line 1284: > >> 1282: INSN(vfwcvt_f_f_v, 0b1010111, 0b001, 0b01100, 0b010010); >> 1283: INSN(vfwcvt_rtz_x_f_v, 0b1010111, 0b001, 0b01111, 0b010010); >> 1284: INSN(vfwcvt_rtz_xu_f_v, 0b1010111, 0b001, 0b01110, 0b010010); > > I see no use of these newly added assembler functions. So test coverage would be an issue. Maybe add them in the future when they are really needed? Sure, will fix. > src/hotspot/cpu/riscv/riscv_v.ad line 3215: > >> 3213: %} >> 3214: >> 3215: instruct vcvtUBtoX_extend(vReg dst, vReg src) %{ > > Personally, I don't like the `_extend` suffix in the instruct name. I prefer names like `vzeroExtBtoX` which make it explicit that this will zero-extend the vector elements. Or simply `vcvtUBtoX`. Agree ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523595888 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1523599044 From dchuyko at openjdk.org Wed Mar 13 16:58:23 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 16:58:23 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 07:43:35 GMT, Serguei Spitsyn wrote: >> Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: >> >> - Resolved master conflicts >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 > > src/hotspot/share/ci/ciEnv.cpp line 1144: > >> 1142: >> 1143: if (entry_bci == InvocationEntryBci) { >> 1144: if (TieredCompilation) { > > Just a naive question. Why this check has been removed? We want to let replacement of C2 method version by another C2 version of the same method in both tired and non-tired mode, which was not allowed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1523618332 From mli at openjdk.org Wed Mar 13 17:05:41 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 13 Mar 2024 17:05:41 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v6] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: remove unused instructions; rename instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/179046b3..3fb61768 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=04-05 Stats: 12 lines in 2 files changed: 0 ins; 4 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From dchuyko at openjdk.org Wed Mar 13 17:14:28 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 17:14:28 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v30] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 48 commits: - Merge branch 'openjdk:master' into compiler-directives-force-update - Resolved master conflicts - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - ... and 38 more: https://git.openjdk.org/jdk/compare/5cae7d20...22b42347 ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=29 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From bkilambi at openjdk.org Wed Mar 13 17:20:16 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 13 Mar 2024 17:20:16 GMT Subject: RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 17:59:17 GMT, Emanuel Peter wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments for changes in backend rules and code style > > On a more visionary note: > > We should make sure that the actual `ReductionNode` gets moved out of the loop, when possible. > [JDK-8309647](https://bugs.openjdk.org/browse/JDK-8309647) [Vector API] Move Reduction outside loop when possible > We have an RFE for that, I have not yet have time or priority for it. > Of course the user can already move it out of the loop themselves. > > If the `ReductionNode` is out of the loop, then you usually just have a very cheap accumulation inside the loop, a `MulVF` for example. That would certainly be cheap enough to allow vectorization. > > So in that case, your optimization here should not just affect SVE, but also NEON and x86. > > Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. > > What do you think about that? > I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE. @eme64 Thank you so much for your review and feedback comments. Here are my responses to your questions - **> So in that case, your optimization here should not just affect SVE, but also NEON and x86.** >From what I understand, even when the reduction nodes are hoisted out of the loop, it would still generate AddReductionVF/VD nodes (fewer as we accumulate inside the loop now) and based on the choice of order the corresponding backend match rules (as included in this patch) should generate the expected instruction sequence. I don't think we would need any code changes for Neon/SVE after hoisting the reductions out of the loop. Please let me know if my understanding is incorrect. **> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.** Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. However, I am not well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). **> What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.** To give a background : The motivation for this patch was a significant performance degradation with SVE instructions compared to Neon for this testcase - FloatMaxVector.ADDLanes on a 128-bit SVE machine. It generates the SVE "fadda" instruction which is a strictly-ordered floating-point add reduction instruction. As it has a higher cost compared to the Neon implementation for FP add reduction, the performance with "fadda" was ~66% worse compared to Neon. As VectorAPI does not impose any rules on FP ordering, it could have generated the faster non-strict Neon instructions instead (on a 128-bit SVE machine). The reason we included a flag "requires_strict_order" to mark a reduction node as strictly-ordered or non-strictly ordered and generate the corresponding backend instructions. On aarch64, this patch only affects the 128-bit SVE machines. On SVE machines >128bits, the "fadda" instruction is generated as it was before this patch. There's no change on Neon as well - the non-strict Ne on instructions are generated with VectorAPI and no auto-vectorization is allowed for FP reduction nodes. Although this change was done keeping SVE in mind, this patch can help generate strictly ordered or non-strictly ordered code on other platforms as well (if they have different implementations for both) and also simplifies the IdealGraph a bit by removing the UnorderedReductionNode. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1995054812 From jbhateja at openjdk.org Wed Mar 13 17:25:23 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Mar 2024 17:25:23 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining Message-ID: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> This bug fix patch fixes following two issues:- 1) Removing memory operand based masked shift instruction selection patterns. As per Java specification section 15.19 shift count is rounded to fit within valid shift range by performing a bitwise AND with shift_mask this results into creation of an AndV IR after loading original mask into vector. Existing patterns will not be able to match this graph shape, extending the patten to cover AndV IR will associate memory operand with And operation and we will need to emit additional vectorAND instruction before shift instruction, existing memory operand patten for AndV already handle such a graph shape. 2) Crash occurs due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining because due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. ByteVector sliceTemplate(int origin, Vector v1) { ByteVector that = (ByteVector) v1; that.check(this); Objects.checkIndex(origin, length() + 1); VectorShuffle iota = iotaShuffle(); VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] iota = iotaShuffle(origin, 1, true); [B] return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] } Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. PredictedCallGenerators (Vector.compare) PredictedCallGenerators (Byte256Vector.compare) ParseGenerator (Byte256Vector.compare) [D] UncommonTrap (receiver other than Byte256Vector) PredictedCallGenerators (Byte128Vector.compare) ParseGenerator (Byte128Vector.compare) [E] UncommonTrap (receiver other than Byte128Vector) [F] PredictedCallGenerators (UncommonTrap) [converged state] = Merge JVM State orginating from C and E [G] Since top level receiver of sliceTemplate is Byte256Vector hence while executing the call generator for Byte128Vector.compare (see code at line E) compiler observes a mismatch b/w incoming argument species i.e. one argument is a 256 bit vector while other is 128 bit vector and throws an exception. At state convergence point (see code at line G), since one of the control path resulted into an exception, compiler propagates the JVM state of other control path comprising of Byte256Mask to downstream graph after bookkeeping the pending exception state. Similar to toVector API, iotaShuffle (see code at line B) is also lazily inlined and returns an abstract vector which results into bi-morphic inlining of rearrange. State convergence due to bi-morphic inlining of rearrange results into generation of an abstract ByteVector (Phi Byte128Vector Byte256Vector) which further causes bi-morphic inlining of blend API due to multiple profile based receiver types. Byte128Vector.blend [Java implementation](https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java#L412) explicitly cast incoming mask (Byte256Mask) by Byte128Mask type and this leads to creation of a[ null value](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/graphKit.cpp#L1417), this causes a crash while unboxing the mask during inline expansion of blend. To be safe here relaxing the null checking constraint during unboxing to disable intrinsification. All existing Vector API JTREG tests are passing with -XX:+StressIncrementalInlining at various AVX levels. Please review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining Changes: https://git.openjdk.org/jdk/pull/18282/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18282&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8319889 Stats: 51 lines in 2 files changed: 3 ins; 46 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18282/head:pull/18282 PR: https://git.openjdk.org/jdk/pull/18282 From dchuyko at openjdk.org Wed Mar 13 17:31:32 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 17:31:32 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: No dots in -r descriptions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14111/files - new: https://git.openjdk.org/jdk/pull/14111/files/22b42347..36c30367 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=30 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=29-30 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From dchuyko at openjdk.org Wed Mar 13 17:34:24 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 13 Mar 2024 17:34:24 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v29] In-Reply-To: References: Message-ID: <1AM7yClR1fxPAHevEwcSaNI8hP-KM2oVTwYT41pyEo0=.06098a55-7e89-4214-bcbe-faef2965f4df@github.com> On Wed, 13 Mar 2024 07:48:35 GMT, Serguei Spitsyn wrote: >> Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: >> >> - Resolved master conflicts >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - Merge branch 'openjdk:master' into compiler-directives-force-update >> - ... and 37 more: https://git.openjdk.org/jdk/compare/782206bc...ff39ac12 > > src/hotspot/share/services/diagnosticCommand.cpp line 928: > >> 926: DCmdWithParser(output, heap), >> 927: _filename("filename", "Name of the directives file", "STRING", true), >> 928: _refresh("-r", "Refresh affected methods.", "BOOLEAN", false, "false") { > > Nit: The dot is not needed at the end, I think. The same applies to lines: 945, 970 and 987. Thanks, the dots were removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1523664980 From vlivanov at openjdk.org Wed Mar 13 19:45:14 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 13 Mar 2024 19:45:14 GMT Subject: RFR: 8319889: Vector API tests trigger VM crashes with -XX:+StressIncrementalInlining In-Reply-To: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> References: <25mULcFaZsaLw0z88ywNaq3virkrimZa-rpgWO8d3Dc=.04909c6c-42f4-4981-895d-2d7f07b027d5@github.com> Message-ID: On Wed, 13 Mar 2024 17:19:40 GMT, Jatin Bhateja wrote: > This bug fix patch fixes following two issues:- > > 1) Removing memory operand based masked shift instruction selection patterns. As per Java specification section 15.19 shift count is rounded to fit within valid shift range by performing a bitwise AND with shift_mask this results into creation of an AndV IR after loading original mask into vector. Existing patterns will not be able to match this graph shape, extending the patten to cover AndV IR will associate memory operand with And operation and we will need to emit additional vectorAND instruction before shift instruction, existing memory operand patten for AndV already handle such a graph shape. > > 2) Crash occurs due to combined effect of bi-morphic inlining, exception handling, randomized incremental inlining. In this case top level slice API is invoked using concrete 256 bit vector, some of the intermediate APIs within sliceTemplate are marked for lazy inlining because due to randomized IncrementalInlining, these APIs returns an abstract vector which when used for virtual dispatch of subsequent APIs results into bi-morphic inlining on account of multiple profile based receiver types. Consider following code snippet. > > > ByteVector sliceTemplate(int origin, Vector v1) { > ByteVector that = (ByteVector) v1; > that.check(this); > Objects.checkIndex(origin, length() + 1); > VectorShuffle iota = iotaShuffle(); > VectorMask blendMask = iota.toVector().compare(VectorOperators.LT, (broadcast((byte)(length() - origin)))); [A] > iota = iotaShuffle(origin, 1, true); [B] > return that.rearrange(iota).blend(this.rearrange(iota), blendMask); [C] > } > > > > Receiver for sliceTemplate is a 256 bit vector, parser defers inlining of toVector() API (see code at line A) and generates a Call IR returning an abstract vector. This abstract vector then virtually dispatches compare API. Compiler observes multiple profile based receiver types (128 and 256 bit byte vectors) for compare API and parser generates a chain of PredictedCallGenerators for bi-morphically inlining it. > > PredictedCallGenerators (Vector.compare) > PredictedCallGenerators (Byte256Vector.compare) > ParseGenerator (Byte256Vector.compare) [D] > UncommonTrap (receiver other than Byte256Vector) > PredictedCallGenerators (Byte128Vector.compare) > ParseGenerator (Byte128Vector.compare) [E... Overall, both fixes look good. I suggest to handle the bugs separately (as 2 bug fixes). src/hotspot/share/opto/vectorIntrinsics.cpp line 164: > 162: Node* GraphKit::unbox_vector(Node* v, const TypeInstPtr* vbox_type, BasicType elem_bt, int num_elem, bool shuffle_to_vector) { > 163: assert(EnableVectorSupport, ""); > 164: const TypePtr* vbox_type_v = gvn().type(v)->isa_ptr(); You can use `isa_instptr()` and check for `nullptr` instead. const TypeInstPtr* vbox_type_v = gvn().type(v)->isa_instptr(); if (vbox_type_v == nullptr || vbox_type->instance_klass() != vbox_type_v->instance_klass()) { return nullptr; // arguments don't agree on vector shapes } ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18282#pullrequestreview-1935065932 PR Review Comment: https://git.openjdk.org/jdk/pull/18282#discussion_r1523825589 From sspitsyn at openjdk.org Wed Mar 13 20:45:47 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 20:45:47 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 16:55:28 GMT, Dmitry Chuyko wrote: >> src/hotspot/share/ci/ciEnv.cpp line 1144: >> >>> 1142: >>> 1143: if (entry_bci == InvocationEntryBci) { >>> 1144: if (TieredCompilation) { >> >> Just a naive question. Why this check has been removed? > > We want to let replacement of C2 method version by another C2 version of the same method in both tired and non-tired mode, which was not allowed Okay, thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14111#discussion_r1523889164 From sspitsyn at openjdk.org Wed Mar 13 20:56:45 2024 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 13 Mar 2024 20:56:45 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 17:31:32 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > No dots in -r descriptions The fix looks good. But I do not have an expertise in the compiler-specific part. So, a review from the Compiler team is still required. ------------- Marked as reviewed by sspitsyn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/14111#pullrequestreview-1935188358 From dholmes at openjdk.org Wed Mar 13 22:56:41 2024 From: dholmes at openjdk.org (David Holmes) Date: Wed, 13 Mar 2024 22:56:41 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: <8vGIupwyAcYvKCUiWiQJpBIWNsHL3b0kjo4miCNiM4g=.ddc38eec-05ff-42d4-8007-37dca0a7169f@github.com> On Thu, 7 Mar 2024 14:05:18 GMT, Oussama Louati wrote: >> Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo in error message in GenManyIndyIncorrectBootstrap.java > > I ran the JTreg test on this PR Head after full conversion of these tests, and nothing unusual happened, those aren't explicitly related to something else. @OssamaLouati thanks for the work you have put into doing this upgrade of the tests. That said I do have a fewconcerns about this change, but let me start by asking you what testing you have performed using the Oracle CI infrastructure? We need to see a full tier 1 - 8 test run on all platforms to ensure this switch is not introducing new timeout failures or OOM conditions, due to the use of this new API. Our`-Xcomp` runs in particular may be adversely affected depending on the number of classes involved compared to ASM. This is difficult to review because we lack Hotspot engineers who know the new ClassFile API. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17834#issuecomment-1996042042 From ddong at openjdk.org Thu Mar 14 05:16:39 2024 From: ddong at openjdk.org (Denghui Dong) Date: Thu, 14 Mar 2024 05:16:39 GMT Subject: RFR: 8327661: C1: Make RBP allocatable on x64 when PreserveFramePointer is disabled [v3] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 06:49:30 GMT, Denghui Dong wrote: >> Hi, >> >> Could I have a review of this change that makes RBP allocatable in c1 register allocation when PreserveFramePointer is not enabled. >> >> There seems no reason that RBP cannot be used. Although the performance of c1 jit code is not very critical, in my opinion, this change will not add overhead of compilation. So maybe it is acceptable. >> >> I am not very sure if I have changed all the places that should be. >> >> Testing: fastdebug tier1-4 on Linux x64 > > Denghui Dong has updated the pull request incrementally with one additional commit since the last revision: > > delete jmh src/hotspot/share/c1/c1_LinearScan.cpp line 5755: > 5753: bool LinearScanWalker::no_allocation_possible(Interval* cur) { > 5754: #ifdef X86 > 5755: #ifndef _LP64 rbp is callee-saved, so the following logic doesn't work. That'll slow down the allocation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18167#discussion_r1524251207 From ksakata at openjdk.org Thu Mar 14 06:05:37 2024 From: ksakata at openjdk.org (Koichi Sakata) Date: Thu, 14 Mar 2024 06:05:37 GMT Subject: RFR: 8320404: Double whitespace in SubTypeCheckNode::dump_spec output In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 07:41:29 GMT, Koichi Sakata wrote: > This is a trivial change to remove an extra whitespace. > > A double whitespace is printed because method->print_short_name already adds a whitespace before the name. > > ### Test > > For testing, I modified the ProfileAtTypeCheck class to fail a test case and display the message. Specifically, I changed the number of the count element in the IR annotation below. > > > @Test > @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.SUBTYPE_CHECK, "1" }) > @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.CMP_P, "5", IRNode.LOAD_KLASS_OR_NKLASS, "2", IRNode.PARTIAL_SUBTYPE_CHECK, "1" }) > public static void test15(Object o) { > > > This change was only for testing, so I reverted back to the original code after the test. > > #### Execution Result > > Before the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(SubTypeCheck.*)+(\s){2}===.*)" > - Failed comparison: [found] 1 = 11 [given] > - Matched node: > * 53 SubTypeCheck === _ 44 35 [[ 58 ]] profiled at: compiler.c2.irTests.ProfileAtTypeCheck::test15:5 !jvms: ProfileAtTypeCheck::test15 @ bci:5 (line 399) > > > After the change: > > $ make test TEST="test/hotspot/jtreg/compiler/c2/irTests/ProfileAtTypeCheck.java" > ... > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "public static void compiler.c2.irTests.ProfileAtTypeCheck.test15(java.lang.Object)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#SUBTYPE_CHECK#_", "11"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, ap > plyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Cons... Could someone please review this pull request? I'd like to have another reviewer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18181#issuecomment-1996588423 From fyang at openjdk.org Thu Mar 14 06:55:38 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 14 Mar 2024 06:55:38 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v6] In-Reply-To: References: Message-ID: <9165g_CT_MsS8iXGsBQyyqaHcfcLZB38Ivziz4Ix3TI=.3887b0ad-eec7-4456-9bb7-fb4a3e8802b1@github.com> On Wed, 13 Mar 2024 17:05:41 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > remove unused instructions; rename instructions test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastRVV.java line 32: > 30: /* > 31: * @test > 32: * @bug 8259610 You might want to change this bug id. test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java line 373: > 371: // to X 64 > 372: makePair(FSPEC64, ISPEC64), > 373: makePair(FSPEC64, ISPEC64, true), Does it make sense to specify `unsignedCast` to true when one of the operand is of type VectorSpecies? I don't see test items like this for other targets like aarch64 neon/sve. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524314068 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524313821 From chagedorn at openjdk.org Thu Mar 14 07:14:02 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 07:14:02 GMT Subject: RFR: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If Message-ID: This is a follow-up to the previous refactoring done in https://github.com/openjdk/jdk/pull/18080. The patch starts to replace the usages of `create_bool_from_template_assertion_predicate()` by providing a refactored and fixed cloning algorithm. #### How `create_bool_from_template_assertion_predicate()` Works Currently, the algorithm in `create_bool_from_template_assertion_predicate()` uses an iterative DFS walk to find all nodes of a Template Assertion Predicate Expression in order to clone them. We do the following: 1. Follow all inputs if they could be a node that's part of a Template Assertion Predicate (compares opcodes): https://github.com/openjdk/jdk/blob/326c91e1a28ec70822ef927ee9ab17f79aa6d35c/src/hotspot/share/opto/loopTransform.cpp#L1513 2. Once we find an `OpaqueLoopInit` or `OpaqueLoopStride` node, we start backtracking in the DFS. While doing so, we start to clone all nodes on the path from the `OpaqueLoop*Nodes` node to the start node and already update the graph. This logic is quite complex and difficult to understand since we do everything simultaneously. This was one of the reasons, I've originally tried to refactor this method in https://github.com/openjdk/jdk/pull/16877 because I needed to extend it for the full fix of Assertion Predicates in JDK-8288981. #### Missing Visited Set The current implementation of `create_bool_from_template_assertion_predicate()` does not use a visited set. This means that whenever we find a diamond shape, we could visit a node twice and re-discover all paths above this diamond again: ... | E | D / \ B C \ / A DFS walk: A -> B -> D -> E -> ... -> C -> D -> E -> ... With each diamond, the number of revisits of each node above doubles. #### Endless DFS in Edge-Cases In most cases, we would normally just stop quite quickly once we follow a data node that is not part of a Template Assertion Predicate Expression because the node opcode is different. However, in the test cases, we create a long chain of data nodes with many diamonds that could all be part of a Template Assertion Predicate Expression (i.e. `is_part_of_template_assertion_predicate_bool()` would return true to follow the inputs in a DFS walk). As a result, the DFS revisits a lot of nodes, especially higher up in the graph, exponentially many times and compilation is stuck for a long time (running the test cases result in a test timeout because background compilation is disabled). #### New DFS Implementation The new algorithm uses again an iterative DFS walk but uses a visited set to avoid this problem. The implementation is found in the new class `DataNodesOnPathToTargets`. It is written in a generic way such that it could potentially be reused at some point (i.e. using "source" and "target" instead of "opaque4" and "opaque loop nodes"). #### New Template Assertion Predicate Expression Cloning Algorithm There is now a new class `TemplateAssertionPredicateExpression` that does the cloning of the Template Assertion Predicate Expression in the following way: 1. Collect nodes to clone with `DataNodesOnPathToTargets`. 2. Clone the collected nodes by reusing and extending `DataNodeGraph`. #### Only Replacing Usages in Loop Unswitching and Split If This patch only replaces the usages of `create_bool_from_template_assertion_predicate()` in Loop Unswitching and Split if which need an identical copy of Template Assertion Predicate Expressions. In JDK-8327111, I will replace the remaining usages which require a transformation of the `OpaqueLoop*Nodes` by adding additional strategies which implement the `TransformStrategyForOpaqueLoopNodes` interface. #### Other Work Left for https://github.com/openjdk/jdk/pull/16877 - Clean up `Split If` code to clone down Template Assertion Predicate Expressions - Removes `is_part_of_template_assertion_predicate_bool()` and `subgraph_has_opaque()` - More renaming and small refactoring Thanks, Christian ------------- Commit messages: - 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix pure cloning cases used for Loop Unswitching and Split If Changes: https://git.openjdk.org/jdk/pull/18293/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18293&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8327110 Stats: 418 lines in 9 files changed: 407 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/18293.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18293/head:pull/18293 PR: https://git.openjdk.org/jdk/pull/18293 From epeter at openjdk.org Thu Mar 14 07:14:43 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Mar 2024 07:14:43 GMT Subject: RFR: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert [v2] In-Reply-To: <-XVbBb-rqIblr2ytrCprcBv7Kg_TW4lpgeE8ZACVIuw=.e77a67bd-d038-4a08-9969-4ce6d3e27309@github.com> References: <-XVbBb-rqIblr2ytrCprcBv7Kg_TW4lpgeE8ZACVIuw=.e77a67bd-d038-4a08-9969-4ce6d3e27309@github.com> Message-ID: <9MmOxsPH9fmeyU5VCwyxfTSSVVTNuIzJhv3IYZ6zET8=.269f6889-7c0e-4503-a516-5f9d16c63015@github.com> On Tue, 12 Mar 2024 08:35:02 GMT, Roland Westrelin wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/loopopts/TestEmptyPreLoopForDifferentMainLoop.java >> >> Co-authored-by: Christian Hagedorn > > Looks good to me. @rwestrel @chhagedorn @vnkozlov thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18200#issuecomment-1996707419 From epeter at openjdk.org Thu Mar 14 07:14:44 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Mar 2024 07:14:44 GMT Subject: Integrated: 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert In-Reply-To: References: Message-ID: <_ImYIDVeQUIpMQQyzdCT-Fkfe4J67_mWZhQ9_hiJ8Xw=.6af27b31-b29b-405d-aba1-54c7a3120ee5@github.com> On Mon, 11 Mar 2024 15:56:38 GMT, Emanuel Peter wrote: > The assert was added in [JDK-8085832](https://bugs.openjdk.org/browse/JDK-8085832) (JDK9), by @rwestrel . And in [JDK-8297724](https://bugs.openjdk.org/browse/JDK-8297724) (JDK21), he made more empty loops be removed, and since then the attached regression test fails. > > ---------- > > **Problem** > > By the time we get to the assert, we already have had a series of Pre-Main-Post, unroll and empty-loop removal: > the PURPLE main and post loops are already previously removed as empty-loops. > > At the time of the assert, the graph looks like this: > ![image](https://github.com/openjdk/jdk/assets/32593061/cb36eda4-0684-4b79-8557-0fdd5973ab50) > > We are in `IdealLoopTree::remove_main_post_loops` with the PURPLE `298 CountedLoop` as the `cl` pre-loop. > > The loop-tree looks essencially like this: > > (rr) p _ltree_root->dump() > Loop: N0/N0 has_sfpt > Loop: N425/N431 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre sfpts={ 429 } > Loop: N298/N301 profile_predicated predicated counted [0,int),+1 (4 iters) pre > Loop: N200/N179 counted [int,100),+1 (2147483648 iters) main sfpts={ 171 } > Loop: N398/N404 counted [int,100),+1 (4 iters) post sfpts={ 402 } > > > This is basically: > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > > From `298 pre PURPLE`, we try to find its main-loop, by looking at the `_next` info in the loop-tree. > There, we find `200 main orange`, it is a main-loop that still is has a pre-loop... > ...but not the same pre-loop as `cl` -> the `assert` fires. > > It seems that we assume in the code, that we can check the `_next->_head`, and if: > 1) it is a main-loop and > 2) that main-loop still has a pre-loop > then the current pre-loop "cl" must be the pre-loop of that found main-loop locate `_pre_from_main(main_head)`. > But this is NOT generally guaranteed by "PhaseIdealLoop::build_loop_tree". > > The loop-tree is correct here, and this is how it was arrived at: > "415 CountedLoop" (pre orange) is visited, and its body traversed. "427 If" is traversed. Now the path splits. > If we first took the "428 IfFalse" path, then we would visit "200 CountedLoop" (main orange), and "398 CountedLoop" (post orange) first. > But we instead take "432 IfTrue" first, and hence visit "298 CountedLoop" (pre PURPLE) first. > > So depending on what turn we take at this "427 If", we either get the order: > > > 415 pre orange > 298 pre PURPLE > 200 main orange > 398 post orange > > (the one we get, and assert with) > > OR > > > 415 pre orange > 200 main orange > 398 post orange > 298 pre PURPLE > > (assert woud not tr... This pull request has now been integrated. Changeset: fadc4b19 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/fadc4b197e927cfa1814fe6cb65ee04b3bd4b0c2 Stats: 64 lines in 2 files changed: 61 ins; 2 del; 1 mod 8327423: C2 remove_main_post_loops: check if main-loop belongs to pre-loop, not just assert Reviewed-by: kvn, chagedorn, roland ------------- PR: https://git.openjdk.org/jdk/pull/18200 From tholenstein at openjdk.org Thu Mar 14 08:41:50 2024 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 14 Mar 2024 08:41:50 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 17:31:32 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > No dots in -r descriptions Looks good to me too (Compiler Team) ------------- Marked as reviewed by tholenstein (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/14111#pullrequestreview-1936031298 From shade at openjdk.org Thu Mar 14 09:21:38 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 14 Mar 2024 09:21:38 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total Thanks! Any additional reviews, maybe @TobiHartmann or @chhagedorn ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/18249#issuecomment-1996994554 From dchuyko at openjdk.org Thu Mar 14 09:22:47 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 14 Mar 2024 09:22:47 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v31] In-Reply-To: References: Message-ID: On Wed, 13 Mar 2024 17:31:32 GMT, Dmitry Chuyko wrote: >> Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. >> >> A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. >> >> It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). >> >> Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. >> >> A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. >> >> In addition, a new diagnostic command `Compiler.replace_directives... > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > No dots in -r descriptions Thank you, Seguei and Tobias. ------------- PR Comment: https://git.openjdk.org/jdk/pull/14111#issuecomment-1996995661 From dchuyko at openjdk.org Thu Mar 14 09:26:00 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 14 Mar 2024 09:26:00 GMT Subject: RFR: 8309271: A way to align already compiled methods with compiler directives [v32] In-Reply-To: References: Message-ID: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... Dmitry Chuyko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 50 commits: - Merge branch 'openjdk:master' into compiler-directives-force-update - No dots in -r descriptions - Merge branch 'openjdk:master' into compiler-directives-force-update - Resolved master conflicts - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - Merge branch 'openjdk:master' into compiler-directives-force-update - ... and 40 more: https://git.openjdk.org/jdk/compare/49ce85fa...eb4ed2ea ------------- Changes: https://git.openjdk.org/jdk/pull/14111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14111&range=31 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/14111.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14111/head:pull/14111 PR: https://git.openjdk.org/jdk/pull/14111 From mli at openjdk.org Thu Mar 14 09:30:10 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 09:30:10 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v6] In-Reply-To: <9165g_CT_MsS8iXGsBQyyqaHcfcLZB38Ivziz4Ix3TI=.3887b0ad-eec7-4456-9bb7-fb4a3e8802b1@github.com> References: <9165g_CT_MsS8iXGsBQyyqaHcfcLZB38Ivziz4Ix3TI=.3887b0ad-eec7-4456-9bb7-fb4a3e8802b1@github.com> Message-ID: On Thu, 14 Mar 2024 06:47:42 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unused instructions; rename instructions > > test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastRVV.java line 32: > >> 30: /* >> 31: * @test >> 32: * @bug 8259610 > > You might want to change this bug id. fixed. > test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java line 373: > >> 371: // to X 64 >> 372: makePair(FSPEC64, ISPEC64), >> 373: makePair(FSPEC64, ISPEC64, true), > > Does it make sense to specify `unsignedCast` to true when one of the operand is of type VectorSpecies? I don't see test items like this for other targets like aarch64 neon/sve. Thanks for catching, fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524521620 PR Review Comment: https://git.openjdk.org/jdk/pull/18040#discussion_r1524521562 From mli at openjdk.org Thu Mar 14 09:30:09 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 09:30:09 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v7] In-Reply-To: References: Message-ID: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: remove unused test cases; fix bug id ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18040/files - new: https://git.openjdk.org/jdk/pull/18040/files/3fb61768..7844a987 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18040&range=05-06 Stats: 29 lines in 2 files changed: 0 ins; 27 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/18040.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18040/head:pull/18040 PR: https://git.openjdk.org/jdk/pull/18040 From fyang at openjdk.org Thu Mar 14 09:37:42 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 14 Mar 2024 09:37:42 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v7] In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 09:30:09 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch to add support for some vector intrinsics? >> Also complement various tests on riscv. >> Thanks. >> >> ## Test >> test/hotspot/jtreg/compiler/vectorapi/ >> test/hotspot/jtreg/compiler/vectorization/ > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > remove unused test cases; fix bug id Updated change LGTM. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18040#pullrequestreview-1936159000 From chagedorn at openjdk.org Thu Mar 14 09:48:39 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 09:48:39 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total Looks reasonable to me, too. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18249#pullrequestreview-1936183471 From shade at openjdk.org Thu Mar 14 10:29:49 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 14 Mar 2024 10:29:49 GMT Subject: RFR: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total All right, thanks! I checked that both fastdebug and release binaries work well with java.base tests too. It also improves large CTW run times significantly. We are able to CTW 130K JARs in 24 hours now, about 3x improvement. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18249#issuecomment-1997118203 From shade at openjdk.org Thu Mar 14 10:29:49 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 14 Mar 2024 10:29:49 GMT Subject: Integrated: 8325613: CTW: Stale method cleanup requires GC after Sweeper removal In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 19:11:42 GMT, Aleksey Shipilev wrote: > See more details in the bug. There is a double-whammy from two issues: a) Sweeper was removed, and now the cleanup work is done during GC, which does not really happen as CTW barely allocates anything; b) CTW calls for explicit deoptimization often, at which point CTW threads get mostly busy at spin-waiting-yielding for deopt epoch to move (that is why you see lots of `sys%`). (a) leads to stale methods buildup, which makes (b) progressively worse. > > This PR adds explicit GC calls to CTW runner. Since CTW allocates and retains a little, those GCs are quite fast. I chose the threshold by running some CTW tests on my machines. I think we are pretty flat in 25..100 region, so I chose the higher threshold for additional safety. > > This patch improves both CPU and wall times for CTW testing dramatically, as you can see from the logs below. It still does not recuperate completely to JDK 17 levels, but it least it is not regressing as badly. > > > --- x86_64 EC2, applications/ctw/modules CTW > > jdk17u-dev: 4511.54s user 169.43s system 1209% cpu 6:27.07 total > current mainline: 11678.13s user 8687.06s system 2299% cpu 14:45.62 total > > GC every 25 methods: 5050.83s user 670.38s system 1629% cpu 5:51.04 total > GC every 50 methods: 4965.41s user 709.64s system 1670% cpu 5:39.77 total > GC every 100 methods: 4997.34s user 782.12s system 1680% cpu 5:43.99 total > GC every 200 methods: 5237.76s user 943.51s system 1788% cpu 5:45.59 total > GC every 400 methods: 5851.24s user 1443.16s system 1914% cpu 6:20.99 total > GC every 800 methods: 7010.06s user 2649.35s system 2079% cpu 7:44.48 total > GC every 1600 methods: 9361.12s user 5616.84s system 2409% cpu 10:21.68 total > > --- Mac M1, applications/ctw/modules/java.base CTW > > jdk17u-dev: 171.93s user 25.33s system 157% cpu 2:05.34 total > current mainline: 1128.69s user 349.46s system 249% cpu 9:52.51 total > > GC every 25 methods: 252.31s user 29.98s system 172% cpu 2:43.68 total > GC every 50 methods: 232.53s user 28.49s system 170% cpu 2:32.69 total > GC every 100 methods: 237.38s user 34.53s system 169% cpu 2:40.54 total > GC every 200 methods: 251.70s user 39.60s system 172% cpu 2:48.40 total > GC every 400 methods: 271.50s user 42.55s system 185% cpu 2:49.66 total > GC every 800 methods: 389.51s user 69.41s system 204% cpu 3:44.01 total > GC every 1600 methods: 660.98s user 169.97s system 229% cpu 6:01.78 total This pull request has now been integrated. Changeset: 1281e18f Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/1281e18f1447848d7eb5e3bde508ac002b4c390d Stats: 26 lines in 2 files changed: 24 ins; 1 del; 1 mod 8325613: CTW: Stale method cleanup requires GC after Sweeper removal Reviewed-by: roland, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/18249 From mli at openjdk.org Thu Mar 14 11:23:56 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:23:56 GMT Subject: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v6] In-Reply-To: References: Message-ID: > HI, > Can you have a look at this patch adding some tests for Math.round instrinsics? > Thanks! > > ### FYI: > During the development of RoundVF/RoundF, we faced the issues which were only spotted by running test exhaustively against 32/64 bits range of int/long. > It's helpful to add these exhaustive tests in jdk for future possible usage, rather than build it everytime when needed. > Of course, we need to put it in `manual` mode, so it's not run when `-automatic` jtreg option is specified which I guess is the mode CI used, please correct me if I'm assume incorrectly. Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Fix test failure in TestRoundVectorDoubleRandom.java - Merge branch 'master' into round-v-exhaustive-tests - add comments; refine code - refine code; fix bug - Merge branch 'master' into round-v-exhaustive-tests - fix issue - mv tests - use IR framework to construct the random tests - Initial commit ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17753/files - new: https://git.openjdk.org/jdk/pull/17753/files/2afa8160..3f50c062 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17753&range=04-05 Stats: 98690 lines in 2066 files changed: 16605 ins; 76003 del; 6082 mod Patch: https://git.openjdk.org/jdk/pull/17753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17753/head:pull/17753 PR: https://git.openjdk.org/jdk/pull/17753 From mli at openjdk.org Thu Mar 14 11:25:51 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:25:51 GMT Subject: Integrated: 8321021: RISC-V: C2 VectorUCastB2X In-Reply-To: References: Message-ID: <9R07HBfLtn-1M6gLoa-_-a1fWEYjrXWvyZjM_2mNIp0=.507eeabf-4b5a-4364-8326-1e05fdc481a2@github.com> On Wed, 28 Feb 2024 11:07:39 GMT, Hamlin Li wrote: > Hi, > Can you help to review the patch to add support for some vector intrinsics? > Also complement various tests on riscv. > Thanks. > > ## Test > test/hotspot/jtreg/compiler/vectorapi/ > test/hotspot/jtreg/compiler/vectorization/ This pull request has now been integrated. Changeset: 1d34b74a Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/1d34b74a64fba8d0d58dcbccc416379a4c915738 Stats: 639 lines in 5 files changed: 605 ins; 11 del; 23 mod 8321021: RISC-V: C2 VectorUCastB2X 8321023: RISC-V: C2 VectorUCastS2X 8321024: RISC-V: C2 VectorUCastI2X Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/18040 From mli at openjdk.org Thu Mar 14 11:25:49 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:25:49 GMT Subject: RFR: 8321021: RISC-V: C2 VectorUCastB2X [v7] In-Reply-To: References: Message-ID: <9tNyag92zj8kR4JWbk3KCRMMqfTNwzEFu51vumlQPUI=.85c89672-a6bc-4ef2-a4b9-3faeea6e82d0@github.com> On Thu, 14 Mar 2024 09:35:02 GMT, Fei Yang wrote: > Updated change LGTM. Thanks. Thanks @RealFYang @zifeihan for your reviewing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18040#issuecomment-1997222178 From mli at openjdk.org Thu Mar 14 11:41:52 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 14 Mar 2024 11:41:52 GMT Subject: RFR: 8321010: RISC-V: C2 RoundVF [v3] In-Reply-To: References: Message-ID: > Hi, > Can you have a review on this patch to add RoundVF/RoundDF intrinsics? > Thanks! > > ## Tests > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectRiscv64.java test/hotspot/jtreg/compiler/c2/cr6340864/TestFloatVect.java test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java test/hotspot/jtreg/compiler/floatingpoint/TestRound.java > > test/jdk/java/lang/Math/RoundTests.java Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - merge master - fix space - add tests - add test cases - v2: (src + 0.5) + rdn - Fix corner cases - Merge branch 'master' into round-F+D-v - refine code - RoundVF/D: Initial commit ------------- Changes: https://git.openjdk.org/jdk/pull/17745/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17745&range=02 Stats: 234 lines in 7 files changed: 230 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17745.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17745/head:pull/17745 PR: https://git.openjdk.org/jdk/pull/17745 From galder at openjdk.org Thu Mar 14 12:22:58 2024 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 14 Mar 2024 12:22:58 GMT Subject: RFR: 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays [v6] In-Reply-To: References: Message-ID: On Tue, 12 Mar 2024 18:16:24 GMT, Dean Long wrote: > IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support? I might be wrong but the way I understood the code, I think other platforms will have no issue with that with the way the current code works: The check I added in `Compiler::is_intrinsic_supported()` means that for clone calls in other platforms it would return false. If that returns false, `AbstractCompiler::is_intrinsic_available` will return false. Then this means that in `GraphBuilder::try_inline_intrinsics` `is_available` would be false, in which case the method will always return false and `build_graph_for_intrinsic` will not be called. `GraphBuilder::append_alloc_array_copy` is called from `build_graph_for_intrinsic`, so I don't see a danger of that being called for non-supported platforms. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1997319836 From dchuyko at openjdk.org Thu Mar 14 12:41:53 2024 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 14 Mar 2024 12:41:53 GMT Subject: Integrated: 8309271: A way to align already compiled methods with compiler directives In-Reply-To: References: Message-ID: On Wed, 24 May 2023 00:38:27 GMT, Dmitry Chuyko wrote: > Compiler Control (https://openjdk.org/jeps/165) provides method-context dependent control of the JVM compilers (C1 and C2). The active directive stack is built from the directive files passed with the `-XX:CompilerDirectivesFile` diagnostic command-line option and the Compiler.add_directives diagnostic command. It is also possible to clear all directives or remove the top from the stack. > > A matching directive will be applied at method compilation time when such compilation is started. If directives are added or changed, but compilation does not start, then the state of compiled methods doesn't correspond to the rules. This is not an error, and it happens in long running applications when directives are added or removed after compilation of methods that could be matched. For example, the user decides that C2 compilation needs to be disabled for some method due to a compiler bug, issues such a directive but this does not affect the application behavior. In such case, the target application needs to be restarted, and such an operation can have high costs and risks. Another goal is testing/debugging compilers. > > It would be convenient to optionally reconcile at least existing matching nmethods to the current stack of compiler directives (so bypass inlined methods). > > Natural way to eliminate the discrepancy between the result of compilation and the broken rule is to discard the compilation result, i.e. deoptimization. Prior to that we can try to re-compile the method letting compile broker to perform it taking new directives stack into account. Re-compilation helps to prevent hot methods from execution in the interpreter. > > A new flag `-r` has beed introduced for some directives related to compile commands: `Compiler.add_directives`, `Compiler.remove_directives`, `Compiler.clear_directives`. The default behavior has not changed (no flag). If the new flag is present, the command scans already compiled methods and puts methods that have any active non-default matching compiler directives to re-compilation if possible, otherwise marks them for deoptimization. There is currently no distinction which directives are found. In particular, this means that if there are rules for inlining into some method, it will be refreshed. On the other hand, if there are rules for a method and it was inlined, top-level methods won't be refreshed, but this can be achieved by having rules for them. > > In addition, a new diagnostic command `Compiler.replace_directives`, has been added for ... This pull request has now been integrated. Changeset: c879627d Author: Dmitry Chuyko URL: https://git.openjdk.org/jdk/commit/c879627dbd7e9295d44f19ef237edb5de10805d5 Stats: 381 lines in 15 files changed: 348 ins; 3 del; 30 mod 8309271: A way to align already compiled methods with compiler directives Reviewed-by: apangin, sspitsyn, tholenstein ------------- PR: https://git.openjdk.org/jdk/pull/14111 From mbaesken at openjdk.org Thu Mar 14 12:50:46 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Thu, 14 Mar 2024 12:50:46 GMT Subject: RFR: JDK-8328165: improve assert(idx < _maxlrg) failed: oob Message-ID: The assert in chaitin.hpp assert(idx < _maxlrg) failed: oob could be improved, it should show more information. ------------- Commit messages: - JDK-8328165 Changes: https://git.openjdk.org/jdk/pull/18302/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18302&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8328165 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18302.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18302/head:pull/18302 PR: https://git.openjdk.org/jdk/pull/18302 From mdoerr at openjdk.org Thu Mar 14 13:08:38 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 14 Mar 2024 13:08:38 GMT Subject: RFR: JDK-8328165: improve assert(idx < _maxlrg) failed: oob In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 12:45:20 GMT, Matthias Baesken wrote: > The assert in chaitin.hpp > assert(idx < _maxlrg) failed: oob > could be improved, it should show more information. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18302#pullrequestreview-1936648175 From roland at openjdk.org Thu Mar 14 14:28:01 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Mar 2024 14:28:01 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert Message-ID: In `IfNode::fold_compares_helper()`, `adjusted_val` is: (SubI (AddI top constant) 0) which is then transformed to the `top` node. The code, next, tries to destroy the `adjusted_val` node i.e. the `top` node. That results in the assert failure. Given We're trying to fold 2 ifs in a dying part of the graph, the fix is straightforward: test `adjusted_val` for top and bail out from the transformation if that's the case. ------------- Commit messages: - whitespaces - fix & test Changes: https://git.openjdk.org/jdk/pull/18305/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18305&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8308660 Stats: 70 lines in 2 files changed: 70 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/18305.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18305/head:pull/18305 PR: https://git.openjdk.org/jdk/pull/18305 From duke at openjdk.org Thu Mar 14 15:13:42 2024 From: duke at openjdk.org (Oussama Louati) Date: Thu, 14 Mar 2024 15:13:42 GMT Subject: RFR: 8294976: test/hotspot 183 test classes use ASM [v5] In-Reply-To: References: Message-ID: On Thu, 7 Mar 2024 14:05:18 GMT, Oussama Louati wrote: >> Oussama Louati has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo in error message in GenManyIndyIncorrectBootstrap.java > > I ran the JTreg test on this PR Head after full conversion of these tests, and nothing unusual happened, those aren't explicitly related to something else. > @OssamaLouati thanks for the work you have put into doing this upgrade of the tests. That said I do have a fewconcerns about this change, but let me start by asking you what testing you have performed using the Oracle CI infrastructure? We need to see a full tier 1 - 8 test run on all platforms to ensure this switch is not introducing new timeout failures or OOM conditions, due to the use of this new API. Our`-Xcomp` runs in particular may be adversely affected depending on the number of classes involved compared to ASM. > > This is difficult to review because we lack Hotspot engineers who know the new ClassFile API. I started running the full tier1-8 tests on mach5, I will wait until the jobs finish and update the Openjdk bug with the confidential comment it with the link. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17834#issuecomment-1997685719 From chagedorn at openjdk.org Thu Mar 14 15:16:40 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 15:16:40 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 14:12:09 GMT, Roland Westrelin wrote: > In `IfNode::fold_compares_helper()`, `adjusted_val` is: > > > (SubI (AddI top constant) 0) > > > which is then transformed to the `top` node. The code, next, tries to > destroy the `adjusted_val` node i.e. the `top` node. That results in > the assert failure. Given We're trying to fold 2 ifs in a dying part > of the graph, the fix is straightforward: test `adjusted_val` for top > and bail out from the transformation if that's the case. Otherwise, looks good! test/hotspot/jtreg/compiler/c2/TestFoldIfRemovesTopNode.java line 29: > 27: * @summary C2 compilation hits 'node must be dead' assert > 28: * @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:+StressIGVN -XX:StressSeed=242006623 TestFoldIfRemovesTopNode > 29: * @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:+StressIGVN TestFoldIfRemovesTopNode You should add `-XX:+UnlockDiagnosticVMOptions` to run with product and either add an `-XX:+IgnoreUnrecognizedVMOptions` or `@requires vm.compiler2.enabled` since `StressIGVN` is a C2 flag. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18305#pullrequestreview-1937005031 PR Review Comment: https://git.openjdk.org/jdk/pull/18305#discussion_r1525053127 From chagedorn at openjdk.org Thu Mar 14 15:18:38 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Mar 2024 15:18:38 GMT Subject: RFR: JDK-8328165: improve assert(idx < _maxlrg) failed: oob In-Reply-To: References: Message-ID: On Thu, 14 Mar 2024 12:45:20 GMT, Matthias Baesken wrote: > The assert in chaitin.hpp > assert(idx < _maxlrg) failed: oob > could be improved, it should show more information. Looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18302#pullrequestreview-1937017655 From roland at openjdk.org Thu Mar 14 16:02:12 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Mar 2024 16:02:12 GMT Subject: RFR: 8308660: C2 compilation hits 'node must be dead' assert [v2] In-Reply-To: